字符占比计算器 - 基于字母和数字字符占比的文本特征提取
输入列名 | 说明 |
|---|---|
texts | 待处理的文本列,要求元素类型为字符串 |
占比结果列,元素为浮点数,表示字母数字字符的占比
如参数没有默认值,则为必填参数
参数名称 | 类型 | 默认值 | 描述 |
|---|---|---|---|
tokenization | bool | False | 是否分词 描述:是否使用分词模式计算占比 默认值:False |
model_path | str | /opt/las/models | 模型文件所在的路径 默认值:"/opt/las/models" |
model_name | str | pythia-6.9b-deduped | 模型名称 默认值:"pythia-6.9b-deduped" |
下面的代码展示了如何使用 daft 运行算子计算文本中字母和数字字符的占比。
from __future__ import annotations import os import daft from daft import col from daft.las.functions.text.alphanumeric_ratio_calculator import AlphanumericRatioCalculator from daft.las.functions.udf import las_udf if __name__ == "__main__": if os.getenv("DAFT_RUNNER", "native") == "ray": import logging import ray def configure_logging(): logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S.%s".format(), ) logging.getLogger("tracing.span").setLevel(logging.WARNING) logging.getLogger("daft_io.stats").setLevel(logging.WARNING) logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING) logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING) logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING) ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging}) daft.context.set_runner_ray() daft.set_execution_config(actor_udf_ready_timeout=600) daft.set_execution_config(min_cpu_per_task=0) samples = { "text": [ "HelloWorld123", "Hello, world!", "!!!@@@###$$$", "Test 123! Is it working?", "你好Hello123", ] } ds = daft.from_pydict(samples) ds = ds.with_column( "alphanumeric_ratio", las_udf( AlphanumericRatioCalculator, construct_args={"tokenization": False}, )(col("text")), ) ds.show() # ╭─────────────────────────┬─────────────────────╮ # │ text ┆ alphanumeric_ratio │ # │ --- ┆ --- │ # │ Utf8 ┆ Float64 │ # ╞═════════════════════════╪═════════════════════╡ # │ HelloWorld123 ┆ 1.0 │ # │ Hello, world! ┆ 0.7692307692307693 │ # │ !!!@@@###$$$ ┆ 0.0 │ # │ Test 123! Is it work… ┆ 0.75 │ # │ 你好Hello123 ┆ 1.0 │ # ╰─────────────────────────┴─────────────────────╯