XTTS-v2 inference_stream函数音频块生成过慢的原因排查求助
XTTS-v2 inference_stream函数音频块生成过慢的原因排查求助
我现在在尝试用Coqui-TTS的XTTS-v2做实时音频生成,但遇到了模型生成音频块速度太慢的问题。我用的是RTX 2070 Super 8GB VRAM的GPU,不确定是调用函数的方式有问题,还是我的GPU性能不够。
我的核心实现脚本如下:
import torch import TTS from TTS.tts.configs.xtts_config import XttsConfig from TTS.utils.manage import ModelManager from TTS.utils.generic_utils import get_user_data_dir from TTS.tts.models.xtts import Xtts import os import sounddevice as sd print('Loading TTS config and model') torch.serialization.add_safe_globals([XttsConfig, TTS.tts.models.xtts.XttsAudioConfig, TTS.config.shared_configs.BaseDatasetConfig, TTS.tts.models.xtts.XttsArgs]) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') tts_model_name = "tts_models/multilingual/multi-dataset/xtts_v2" ModelManager().download_model("tts_models/multilingual/multi-dataset/xtts_v2") model_path = os.path.join(get_user_data_dir("tts"), tts_model_name.replace("/", "--")) config = XttsConfig() config.load_json(os.path.join(model_path, "config.json")) model = Xtts.init_from_config(config) model.load_checkpoint(config, model_path, use_deepspeed=False) voice_wav = './audio/output2.wav' gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=voice_wav, gpt_cond_len=36, gpt_cond_chunk_len=12, load_sr=22050) model.to(device, non_blocking=True) print('TTS model ready') # running text = 'Servus, ich kann dich hören, ist bei dir alles in Ordnung? Kann ich irgendwie helfen? Frag mich wenn du was hast.' with sd.OutputStream(24000, channels=1) as stream: for chunk in model.inference_stream(text, language='de', gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, stream_chunk_size=35): chunk_np = chunk.cpu().numpy() if torch.is_tensor(chunk) else chunk stream.write(chunk_np)
为了测试模型的生成效率,我还写了一段脚本:
import time t0 = time.time() chunk = next(model.inference_stream( text, language='de', gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding )) gen_time = time.time() - t0 t1 = time.time() chunk_np = chunk.cpu().numpy() if torch.is_tensor(chunk) else chunk copy_time = time.time() - t1 print("gen_time:", gen_time, "copy_time:", copy_time, "samples:", chunk_np.size)
得到的输出结果是:
gen_time: 0.8997743129730225 copy_time: 0.0 samples: 21248
想请教大家,生成时间这么长的原因是什么?单纯是我的GPU性能不够,还是有其他可以优化的点?感谢各位的解答!




