语种识别及ASR（whisper 系列模型）--AI 数据湖服务-火山引擎

文档中心

AI 数据湖服务

音频识别

语种识别及ASR（whisper 系列模型）

算子介绍

描述

本算子可用于语种识别、语言识别模块，是基于 Whisper 模型的多语言 LID（Language Identification**，**语言识别） + ASR （Automatic Speech Recognition，自动语音识别）解决方案。

核心功能

多语言识别：支持中英文等百种语言；并支持在识别文本的同时输出语言标签（例如 en、zh）。
标点符号恢复：可自定义选择中英文标点恢复功能，提升输出的文本可读性。
支持多种音频输入格式：URL、二进制等

支持模型

Whisper 系列模型（LID + ASR）
- openai/whisper-large-v3-turbo
- openai/whisper-large-v3
- openai/whisper-medium（中文支持一般）
- openai/whisper-small（中文输出可能为繁体）
中英文标点恢复模型
- iic/punc_ct-transformer_cn-en-common-vocab471067-large

语种支持

支持中文、英文、德语、西班牙语等近百种语种，点击可查看完整语种列表。‌

算子参数

输入

输入列名	说明
audios	包含音频数据的数组。每个元素可以是 `audio_url`（音频文件的 URL 或 TOS 对象存储路径，将会下载到本地后解码）或 `audio_binary`（原始音频字节数据，已解码或原始音频二进制）。

输出

一个结构化结果数组，其中每个元素为一个 struct，包含以下字段：

asr_result (str): 语音识别得到的文本结果。
language (str): 识别出的语言代码（例如 en、zh）。
asr_result_with_punc (Optional[str]): 可选的带标点化文本结果。当初始化时加载了标点模型且可用时返回，否则为 None。

参数

如参数没有默认值，则为必填参数。

参数名称	类型	默认值	描述
audio_src_type	str		输入音频的来源类型，支持 `audio_url`（音频文件的 URL 或 TOS 对象存储路径）和 `audio_binary`（原始音频二进制数据），请确保与传入的 `audios` 数据格式一致。
model_path	str	`/opt/las/models`	模型根目录路径，通常包含若干模型子目录。
model_name	str	`openai/whisper-large-v3`	模型名称，支持的 Whisper 系列模型包括：`whisper-small`（小模型）、`whisper-medium`（中等模型）、`whisper-large-v3`（最新大模型）、`whisper-large-v3-turbo`（优化版大模型），可选值为：`"openai/whisper-small"`、`"openai/whisper-medium"`、`"openai/whisper-large-v3-turbo"`、`"openai/whisper-large-v3"`。
punc_model_name	Optional[str]	`None`	可选的标点恢复模型名称，支持使用 `iic/punc_ct-transformer_cn-en-common-vocab471067-large` 进行中英文标点恢复。若提供该参数，算子会对识别出的纯文本进行标点化处理并通过 `asr_result_with_punc` 字段返回；若未提供或加载失败，则该字段为 `None`。
return_language_only	bool	`False`	是否仅返回语言识别结果而不进行语音转文本。若设置为 `True`，则 `asr_result` 和 `asr_result_with_punc` 字段均为 `None`。
batch_size	int	`10`	每次批处理的音频数量，值越大吞吐越高但显存/内存占用也越大。
device	str	`cpu`	推理设备标识，例如 `"cpu"`、`"cuda"`、`"cuda:0"`，默认使用 `"cpu"`。

调用示例

下面的代码展示了如何使用 daft 运行该算子进行语音的语种和文本识别。

from __future__ import annotations

import logging
import os

import ray

import daft
from daft import col
from daft.las.functions.audio.audio_asr_lid_whisper import AudioAsrLidWhisper
from daft.las.functions.udf import las_udf

if __name__ == "__main__":
    if os.getenv("DAFT_RUNNER", "ray") == "ray":

        def configure_logging():
            logging.basicConfig(
                level=logging.INFO,
                format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
                datefmt="%Y-%m-%d %H:%M:%S.%s".format(),
            )
            logging.getLogger("tracing.span").setLevel(logging.WARNING)
            logging.getLogger("daft_io.stats").setLevel(logging.WARNING)
            logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING)

        import ray

        ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging})
        daft.set_runner_ray()

    daft.set_execution_config(actor_udf_ready_timeout=600)
    daft.set_execution_config(min_cpu_per_task=0)

    tos_dir_url = os.getenv("TOS_DIR_URL", "las-cn-beijing-public-online.tos-cn-beijing.volces.com")
    samples = {
        "audio_path": [
            f"https://{tos_dir_url}/public/shared_audio_dataset/sample_normal.wav"
        ]
    }

    model_path = os.getenv("MODEL_PATH", "/opt/las/models")
    model_name = "openai/whisper-large-v3"
    audio_src_type = "audio_url"
    punc_model_name = "iic/punc_ct-transformer_cn-en-common-vocab471067-large"
    num_gpus = 1
    device = "cuda" if num_gpus > 0 else "cpu"
    batch_size = 1
    return_language_only = False

    df = daft.from_pydict(samples)
    df = df.with_column(
        "asr_result_detail",
        las_udf(
            AudioAsrLidWhisper,
            construct_args={
                "audio_src_type": audio_src_type,
                "model_path": model_path,
                "model_name": model_name,
                "punc_model_name": punc_model_name,
                "return_language_only": return_language_only,
                "batch_size": batch_size,
                "device": device,
            },
            num_gpus=num_gpus,
            batch_size=1,
            num_cpus=4,
            concurrency=1,
        )(col("audio_path")),
    )

    df.show()

    # ╭───────────────────┬──────────────────────────────────────────╮
    # │ audio_path                     ┆ asr_result_detail                                                    │
    # │ ---                            ┆ ---                                                                  │
    # │ Utf8                           ┆ Struct[asr_result: Utf8, language: Utf8, asr_result_with_punc: Utf8] │
    # ╞═══════════════════╪══════════════════════════════════════════╡
    # │ https://las-cn-beijing-publi-… ┆ {asr_result: 人我保住了金我取到了俺老孙啥功名…                            │
    # ╰───────────────────┴──────────────────────────────────────────╯

最近更新时间：2026.03.30 14:23:36

这个页面对您有帮助吗？

有用

无用

AI 数据湖服务

描述 #

核心功能 #

支持模型 #

语种支持 #

推荐实践 #

算子参数 #

输入 #

输出 #

参数 #

调用示例 #

描述

核心功能

支持模型

语种支持

推荐实践

算子参数

输入

输出

参数

调用示例