语音转文字（whisper 系列模型）--AI 数据湖服务-火山引擎

文档中心

AI 数据湖服务

音频识别

语音转文字（whisper 系列模型）

算子介绍

描述

语音识别模块 - 基于Whisper模型的多语言语音转文字解决方案

核心功能

多语言识别：支持中英文等主流语言
语音翻译：可将识别结果翻译为英文

支持模型

openai/whisper-large-v3-turbo
openai/whisper-large-v3
openai/whisper-medium（中文支持一般）
openai/whisper-small（中文输出繁体字）

语种支持

完整语种列表请参考：支持的语种列表

Daft 调用

算子参数

输入

输入列名	说明
audios	包含音频数据的数组，支持以下格式： audio_base64: base64编码的音频字符串； audio_url: 音频文件URL路径； audio_binary: 原始音频字节数据
languages	每个音频对应语种的数组，语种缩写请参考语种缩写。如果该处不传入，则使用参数中的source_language设置。

输入列名

说明

audios

包含音频数据的数组，支持以下格式：

audio_base64: base64编码的音频字符串；
audio_url: 音频文件URL路径；
audio_binary: 原始音频字节数据

languages

每个音频对应语种的数组，语种缩写请参考语种缩写。如果该处不传入，则使用参数中的source_language设置。

输出

处理后的结构化数组，每个元素包含以下字段：

asr_result: 语音识别文本结果；
timestamps: 时间戳对列表(开始/结束时间)；
segments: 分段文本结果列表

参数

如参数没有默认值，则为必填参数

参数名称	类型	默认值	描述
audio_src_type	str		音频格式类型支持的音频格式类型，包含： - tos/http 地址(audio_url) - base64 编码(audio_base64) - 二进制流(audio_binary) 可选值：["audio_binary", "audio_url", "audio_base64"]
model_path	str	/opt/las/models	模型存储路径默认值："/opt/las/models"
model_name	str	openai/whisper-large-v3	模型名称支持的Whisper系列模型： - whisper-small: 小模型 - whisper-medium: 中等模型 - whisper-large-v3: 最新大模型 - whisper-large-v3-turbo: 优化版大模型可选值：[ "openai/whisper-small", "openai/whisper-medium", "openai/whisper-large-v3-turbo", "openai/whisper-large-v3" ] 默认值："openai/whisper-large-v3"
batch_size	int	10	单次处理的音频样本数量默认值：10
source_language	str	None	音频源语言支持：chinese/english/japanese/korean等，可以设置为None以启用自动检测默认值：None
translate_to_english	bool	False	英文翻译模式是否将识别结果翻译为英文启用后输出文本将为英文翻译结果默认值：False
condition_on_prev_tokens	bool	True	历史依赖模式是否基于历史token进行预测关闭后会降低结果连贯性但提升处理速度默认值：True
compression_ratio_threshold	float	1.35	文本压缩阈值控制生成文本的压缩程度（建议范围1.2-2.0）值越大保留的重复内容越多默认值：1.35
temperature	float	0.5	温度系数控制生成文本的随机性（0.0-1.0）较高值适合创造性场景，较低值适合确定性场景默认值：0.5
logprob_threshold	float	-1.0	对数概率阈值对数概率阈值，过滤置信度过低的词。若词的对数概率低于此值，可能被拒绝。默认为-1.0，不启用过滤，保留所有词。
dtype	str	bfloat16	计算精度类型模型推理使用的数值精度： - bfloat16: 平衡精度与速度（默认） - float16: 更快的推理速度 - float32: 最高精度可选值：["bfloat16", "float16", "float32"] 默认值："bfloat16"
rank	int	0	GPU设备编号指定使用的GPU设备ID（多卡环境生效）默认使用首张显卡（ID=0）

调用示例

下面的代码展示了如何使用 daft 运行算子将语音转换为文字。

from __future__ import annotations

import logging
import os

import ray

import daft
from daft import col
from daft.las.functions.audio.audio_asr_whisper import AudioAsrWhisper
from daft.las.functions.udf import las_udf

if __name__ == "__main__":

    if os.getenv("DAFT_RUNNER", "ray") == "ray":

        def configure_logging():
            logging.basicConfig(
                level=logging.INFO,
                format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
                datefmt="%Y-%m-%d %H:%M:%S.%s".format(),
            )
            logging.getLogger("tracing.span").setLevel(logging.WARNING)
            logging.getLogger("daft_io.stats").setLevel(logging.WARNING)
            logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING)

        import ray

        ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging})
        daft.set_runner_ray()

    daft.set_execution_config(actor_udf_ready_timeout=600)
    daft.set_execution_config(min_cpu_per_task=0)

    tos_dir_url = os.getenv("TOS_DIR_URL", "las-cn-beijing-public-online.tos-cn-beijing.volces.com")
    samples = {
        "audio_path": [
            f"https://{tos_dir_url}/public/shared_audio_dataset/参观八达岭长城。.wav"
        ]
    }

    model_path = os.getenv("MODEL_PATH", "/opt/las/models")
    model_name = "openai/whisper-large-v3"
    audio_src_type = "audio_url"
    dtype = "bfloat16"
    source_language = "chinese"
    translate_to_english = False
    condition_on_prev_tokens = True
    compression_ratio_threshold = 1.35
    temperature = 0.5
    logprob_threshold = -1.0
    batch_size = 1
    rank = 0

    df = daft.from_pydict(samples)
    df = df.with_column(
        "asr_result_detail",
        las_udf(
            AudioAsrWhisper,
            construct_args={
                "audio_src_type": audio_src_type,
                "model_path": model_path,
                "model_name": model_name,
                "dtype": dtype,
                "source_language": source_language,
                "translate_to_english": translate_to_english,
                "condition_on_prev_tokens": condition_on_prev_tokens,
                "compression_ratio_threshold": compression_ratio_threshold,
                "temperature": temperature,
                "logprob_threshold": logprob_threshold,
                "batch_size": batch_size,
                "rank": rank,
            },
            num_gpus=1,
            batch_size=1,
            concurrency=1,
        )(col("audio_path")),
    )
    df.show()

    # ╭────────────────────────────────┬─────────────────────────────────────────────────────────────╮
    # │ audio_path                     ┆ asr_result_detail                                           │
    # │ ---                            ┆ ---                                                         │
    # │ Utf8                           ┆ Struct[asr_result: Utf8, timestamps: List[List[Float32]],   │
    # │                                ┆ segments: List[Utf8]]                                       │
    # ╞════════════════════════════════╪═════════════════════════════════════════════════════════════╡
    # │ tos://las-cn-beijing-publi-…   ┆ {asr_result: 参观八道岭长城,                                  │
    # │                                ┆ timesta…                                                    │
    # ╰────────────────────────────────┴─────────────────────────────────────────────────────────────╯

最近更新时间：2026.03.30 14:23:36

这个页面对您有帮助吗？

有用

无用

AI 数据湖服务

描述 #

核心功能 #

推荐实践 #

支持模型 #

语种支持 #

算子参数 #

输入 #

输出 #

参数 #

调用示例 #

描述

核心功能

推荐实践

支持模型

语种支持

算子参数

输入

输出

参数

调用示例