You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

基于Streamlit+Whisper的实时语音问答APP无法监听音频并返回响应的问题求助

基于Streamlit+Whisper的实时语音问答APP无法监听音频并返回响应的问题求助

大家好,我正在开发一个实时语音问答应用,用到了Streamlit、OpenAI Whisper和SoundDevice,原本期望实现这些功能:

  • 实时监听麦克风的音频输入
  • (可选)用Whisper把音频转成文字
  • 调用OpenAI语言模型生成回答
  • 在Streamlit界面实时显示转录的问题和AI的回答

但目前应用完全没法按预期工作——启动后UI显示“Listening for interview questions...”,但对着麦克风说话完全没反应,偶尔还会直接崩溃弹出“Connection Error”提示。

我已经尝试了这些方法,但都没解决问题:

  • 把Whisper模型从base换成tiny,减少内存占用
  • 用单独的线程处理音频捕获和转录,避免阻塞UI
  • 在搭载M2 Pro芯片的macOS Monterey上运行
  • 重启Streamlit应用,并且确认麦克风权限已经开启

我的开发环境:

  • OS:macOS Monterey 12.6
  • Python版本:3.10
  • Streamlit版本:1.25.0
  • Whisper版本:最新版
  • SoundDevice版本:最新版
  • 硬件:2024款MacBook Pro(M2 Pro芯片)

以下是我的代码:

import os
import openai
import whisper
import streamlit as st
import sounddevice as sd
import numpy as np
import queue
import threading
import time

# Initialize Whisper model
whisper_model = whisper.load_model('tiny')

# Streamlit setup
st.title("AI/ML Interview Assistant")
st.markdown("Listening for interview questions...")

# Real-time audio queue
audio_queue = queue.Queue()

# Audio callback to capture microphone input
def audio_callback(indata, frames, time, status):
    audio_queue.put(indata.copy())

# Transcribe audio and generate responses
def transcribe_and_respond():
    audio_data = []
    while True:
        try:
            if not audio_queue.empty():
                audio_data.append(audio_queue.get())
                if len(audio_data) > 20:
                    audio_segment = np.concatenate(audio_data, axis=0)
                    audio_data.clear()
                    transcription = whisper_model.transcribe(audio_segment)
                    question = transcription['text']
                    st.text(f"You: {question}")
                    response = generate_response(question)
                    st.text(f"Assistant: {response}")
                    time.sleep(1)
        except Exception as e:
            st.error(f"Error during transcription: {str(e)}")

# Generate response using OpenAI API
def generate_response(question):
    try:
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=f"Q: {question}\nA:",
            max_tokens=150
        )
        return response['choices'][0]['text'].strip()
    except Exception as e:
        return f"Error generating response: {str(e)}"

# Start audio stream in a separate thread
def start_audio_stream():
    try:
        stream = sd.InputStream(callback=audio_callback)
        with stream:
            threading.Thread(target=transcribe_and_respond, daemon=True).start()
            while True:
                time.sleep(0.1)
    except Exception as e:
        st.error(f"Audio stream error: {str(e)}")

# Start the audio stream
start_audio_stream()

针对你的问题,我梳理了几个核心问题点和对应的修复方案:

1. Streamlit线程安全问题(最关键)

Streamlit的UI组件不能在后台线程直接更新,你现在在transcribe_and_respond里直接调用st.text()st.error(),这会导致UI渲染混乱,甚至触发崩溃。正确的做法是用st.session_state作为数据桥梁,后台线程只负责生成数据,主线程定期读取并更新UI。

2. 音频捕获逻辑不完善

你现在判断len(audio_data) > 20才处理音频,但这个数值和实际的音频时长无关(不同采样率下,每帧的时间长度不同),而且没有过滤静音数据,可能导致一直积累不到触发条件,或者处理大量无效静音。

3. OpenAI API密钥未配置

你的代码里完全没有设置openai.api_key,这会直接导致API调用失败,出现Connection Error。

4. SoundDevice设备适配问题

M2 Mac上可能默认音频设备不是麦克风,或者SoundDevice没有正确识别到输入设备,导致无法捕获音频。


修复后的完整代码示例

import os
import openai
import whisper
import streamlit as st
import sounddevice as sd
import numpy as np
import queue
import threading
import time

# 配置OpenAI API密钥(建议从环境变量读取,不要硬编码)
openai.api_key = os.getenv("OPENAI_API_KEY")

# Initialize Whisper model
whisper_model = whisper.load_model('tiny')

# Streamlit初始化
st.title("AI/ML Interview Assistant")
status_text = st.markdown("Listening for interview questions...")

# 用session_state存储对话历史,避免线程安全问题
if "conversation" not in st.session_state:
    st.session_state.conversation = []
conversation_container = st.container()

# Real-time audio queue
audio_queue = queue.Queue()
# 音频参数配置,和Whisper默认一致
SAMPLERATE = 16000
CHANNELS = 1

# 静音检测阈值,过滤无效静音
SILENCE_THRESHOLD = 0.01

# Audio callback to capture microphone input
def audio_callback(indata, frames, time, status):
    if status:
        print(status)
    # 只保留超过静音阈值的音频数据
    if np.abs(indata).mean() > SILENCE_THRESHOLD:
        audio_queue.put(indata.copy())

# Transcribe audio and generate responses(后台线程,只处理数据不更新UI)
def transcribe_and_respond():
    audio_buffer = []
    # 累计约2秒的音频再处理(16000采样率下,每帧默认是1024,2秒约31帧)
    MAX_BUFFER_FRAMES = 30
    while True:
        try:
            if not audio_queue.empty():
                audio_buffer.append(audio_queue.get())
                if len(audio_buffer) >= MAX_BUFFER_FRAMES:
                    # 拼接音频并转换为Whisper需要的格式
                    audio_segment = np.concatenate(audio_buffer, axis=0).flatten()
                    audio_buffer.clear()
                    # Whisper处理
                    transcription = whisper_model.transcribe(audio_segment, language="en")
                    question = transcription['text'].strip()
                    if question:  # 确保转录到有效文本
                        # 生成回答
                        response = generate_response(question)
                        # 把结果放到session_state,交给主线程更新UI
                        st.session_state.conversation.append({"user": question, "assistant": response})
                    time.sleep(0.5)
        except Exception as e:
            # 把错误信息放到session_state
            st.session_state.conversation.append({"error": f"Error: {str(e)}"})
            time.sleep(1)

# Generate response using OpenAI API
def generate_response(question):
    try:
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=f"Q: {question}\nA:",
            max_tokens=150,
            temperature=0.7
        )
        return response['choices'][0]['text'].strip()
    except Exception as e:
        return f"Failed to generate response: {str(e)}"

# Start audio stream in a separate thread
def start_audio_stream():
    try:
        # 指定采样率和通道,和Whisper匹配
        stream = sd.InputStream(
            samplerate=SAMPLERATE,
            channels=CHANNELS,
            callback=audio_callback
        )
        with stream:
            # 启动转录线程
            threading.Thread(target=transcribe_and_respond, daemon=True).start()
            # 保持主线程运行
            while True:
                time.sleep(0.1)
    except Exception as e:
        st.session_state.conversation.append({"error": f"Audio stream error: {str(e)}"})

# 主线程:定期更新UI
def update_ui():
    while True:
        with conversation_container:
            # 清空容器再重新渲染对话
            st.empty()
            for msg in st.session_state.conversation:
                if "user" in msg:
                    st.markdown(f"**You:** {msg['user']}")
                elif "assistant" in msg:
                    st.markdown(f"**Assistant:** {msg['assistant']}")
                elif "error" in msg:
                    st.error(msg['error'])
        time.sleep(0.5)

# 启动音频流和UI更新线程
if not st.session_state.get("started", False):
    st.session_state["started"] = True
    threading.Thread(target=start_audio_stream, daemon=True).start()
    threading.Thread(target=update_ui, daemon=True).start()

额外注意事项:

  • 确保你已经在系统环境变量中设置了OPENAI_API_KEY,或者直接在代码中赋值(不建议硬编码密钥)
  • 运行前确认麦克风权限已经完全开启(系统设置→隐私与安全性→麦克风,确保Streamlit被授权)
  • 如果还是识别不到麦克风,可以用sd.query_devices()查看所有可用设备,然后指定设备ID到InputStreamdevice参数中

备注:内容来源于stack exchange,提问作者Melissa

火山引擎 最新活动