文本语种识别算子 - 基于FastText模型提供多语言识别能力
输入列名 | 说明 |
|---|---|
texts | 原始字符串列,要求元素类型为字符串 |
包含语种标签和置信度分数的结构体列
每个元素包含 language 和 confidence 字段
如参数没有默认值,则为必填参数
参数名称 | 类型 | 默认值 | 描述 |
|---|---|---|---|
model_path | str | /opt/las/models | 模型文件所在的路径 默认值:"/opt/las/models" |
model_name | str | fasttext/lid.176.bin | 模型文件名,支持 "fasttext/lid.176.bin" 或 "fasttext/lid.176.ftz" |
batch_size | int | 1000 | 批量处理大小,较大的batch_size可提升吐吐但增加内存消耗 |
下面的代码展示了如何使用 daft 运行算子识别文本的语种。
from __future__ import annotations import os import daft from daft import col from daft.las.functions.text.language_recognition import LanguageRecognitionOperator from daft.las.functions.udf import las_udf if __name__ == "__main__": if os.getenv("DAFT_RUNNER", "native") == "ray": import logging import ray def configure_logging(): logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S.%s".format(), ) logging.getLogger("tracing.span").setLevel(logging.WARNING) logging.getLogger("daft_io.stats").setLevel(logging.WARNING) logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING) logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING) logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING) ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging}) daft.context.set_runner_ray() daft.set_execution_config(actor_udf_ready_timeout=600) daft.set_execution_config(min_cpu_per_task=0) samples = { "text": [ "这是一行测试内容。", "This is a test content.", "This is a test content.这是一行测试内容。", "こんにちは", "안녕하세요", ] } ds = daft.from_pydict(samples) ds = ds.with_column( "language_result", las_udf( LanguageRecognitionOperator, construct_args={ "model_path": os.getenv("MODEL_PATH", "/opt/las/models"), "model_name": "fasttext/lid.176.bin", "batch_size": 1000, }, )(col("text")), ) ds.show() # ╭──────────────────────────────────────┬─────────────────────────────────────────────╮ # │ text ┆ language_result │ # │ --- ┆ --- │ # │ Utf8 ┆ Struct[language: Utf8, confidence: Float64] │ # ╞══════════════════════════════════════╪═════════════════════════════════════════════╡ # │ 这是一行测试内容。 ┆ {language: zh, confidence: 1.000048279762268} │ # │ This is a test content. ┆ {language: en, confidence: 0.9209088683128357} │ # │ This is a test content.这是一行测试… ┆ {language: zh, confidence: 0.6892356276512146} │ # │ こんにちは ┆ {language: ja, confidence: 1.0000269412994385} │ # │ 안녕하세요 ┆ {language: ko, confidence: 0.9996028542518616} │ # ╰──────────────────────────────────────┴─────────────────────────────────────────────╯