基于Streamlit+Whisper的实时语音问答APP无法监听音频并返回响应的问题求助
基于Streamlit+Whisper的实时语音问答APP无法监听音频并返回响应的问题求助
大家好,我正在开发一个实时语音问答应用,用到了Streamlit、OpenAI Whisper和SoundDevice,原本期望实现这些功能:
- 实时监听麦克风的音频输入
- (可选)用Whisper把音频转成文字
- 调用OpenAI语言模型生成回答
- 在Streamlit界面实时显示转录的问题和AI的回答
但目前应用完全没法按预期工作——启动后UI显示“Listening for interview questions...”,但对着麦克风说话完全没反应,偶尔还会直接崩溃弹出“Connection Error”提示。
我已经尝试了这些方法,但都没解决问题:
- 把Whisper模型从base换成tiny,减少内存占用
- 用单独的线程处理音频捕获和转录,避免阻塞UI
- 在搭载M2 Pro芯片的macOS Monterey上运行
- 重启Streamlit应用,并且确认麦克风权限已经开启
我的开发环境:
- OS:macOS Monterey 12.6
- Python版本:3.10
- Streamlit版本:1.25.0
- Whisper版本:最新版
- SoundDevice版本:最新版
- 硬件:2024款MacBook Pro(M2 Pro芯片)
以下是我的代码:
import os import openai import whisper import streamlit as st import sounddevice as sd import numpy as np import queue import threading import time # Initialize Whisper model whisper_model = whisper.load_model('tiny') # Streamlit setup st.title("AI/ML Interview Assistant") st.markdown("Listening for interview questions...") # Real-time audio queue audio_queue = queue.Queue() # Audio callback to capture microphone input def audio_callback(indata, frames, time, status): audio_queue.put(indata.copy()) # Transcribe audio and generate responses def transcribe_and_respond(): audio_data = [] while True: try: if not audio_queue.empty(): audio_data.append(audio_queue.get()) if len(audio_data) > 20: audio_segment = np.concatenate(audio_data, axis=0) audio_data.clear() transcription = whisper_model.transcribe(audio_segment) question = transcription['text'] st.text(f"You: {question}") response = generate_response(question) st.text(f"Assistant: {response}") time.sleep(1) except Exception as e: st.error(f"Error during transcription: {str(e)}") # Generate response using OpenAI API def generate_response(question): try: response = openai.Completion.create( model="text-davinci-003", prompt=f"Q: {question}\nA:", max_tokens=150 ) return response['choices'][0]['text'].strip() except Exception as e: return f"Error generating response: {str(e)}" # Start audio stream in a separate thread def start_audio_stream(): try: stream = sd.InputStream(callback=audio_callback) with stream: threading.Thread(target=transcribe_and_respond, daemon=True).start() while True: time.sleep(0.1) except Exception as e: st.error(f"Audio stream error: {str(e)}") # Start the audio stream start_audio_stream()
针对你的问题,我梳理了几个核心问题点和对应的修复方案:
1. Streamlit线程安全问题(最关键)
Streamlit的UI组件不能在后台线程直接更新,你现在在transcribe_and_respond里直接调用st.text()和st.error(),这会导致UI渲染混乱,甚至触发崩溃。正确的做法是用st.session_state作为数据桥梁,后台线程只负责生成数据,主线程定期读取并更新UI。
2. 音频捕获逻辑不完善
你现在判断len(audio_data) > 20才处理音频,但这个数值和实际的音频时长无关(不同采样率下,每帧的时间长度不同),而且没有过滤静音数据,可能导致一直积累不到触发条件,或者处理大量无效静音。
3. OpenAI API密钥未配置
你的代码里完全没有设置openai.api_key,这会直接导致API调用失败,出现Connection Error。
4. SoundDevice设备适配问题
M2 Mac上可能默认音频设备不是麦克风,或者SoundDevice没有正确识别到输入设备,导致无法捕获音频。
修复后的完整代码示例
import os import openai import whisper import streamlit as st import sounddevice as sd import numpy as np import queue import threading import time # 配置OpenAI API密钥(建议从环境变量读取,不要硬编码) openai.api_key = os.getenv("OPENAI_API_KEY") # Initialize Whisper model whisper_model = whisper.load_model('tiny') # Streamlit初始化 st.title("AI/ML Interview Assistant") status_text = st.markdown("Listening for interview questions...") # 用session_state存储对话历史,避免线程安全问题 if "conversation" not in st.session_state: st.session_state.conversation = [] conversation_container = st.container() # Real-time audio queue audio_queue = queue.Queue() # 音频参数配置,和Whisper默认一致 SAMPLERATE = 16000 CHANNELS = 1 # 静音检测阈值,过滤无效静音 SILENCE_THRESHOLD = 0.01 # Audio callback to capture microphone input def audio_callback(indata, frames, time, status): if status: print(status) # 只保留超过静音阈值的音频数据 if np.abs(indata).mean() > SILENCE_THRESHOLD: audio_queue.put(indata.copy()) # Transcribe audio and generate responses(后台线程,只处理数据不更新UI) def transcribe_and_respond(): audio_buffer = [] # 累计约2秒的音频再处理(16000采样率下,每帧默认是1024,2秒约31帧) MAX_BUFFER_FRAMES = 30 while True: try: if not audio_queue.empty(): audio_buffer.append(audio_queue.get()) if len(audio_buffer) >= MAX_BUFFER_FRAMES: # 拼接音频并转换为Whisper需要的格式 audio_segment = np.concatenate(audio_buffer, axis=0).flatten() audio_buffer.clear() # Whisper处理 transcription = whisper_model.transcribe(audio_segment, language="en") question = transcription['text'].strip() if question: # 确保转录到有效文本 # 生成回答 response = generate_response(question) # 把结果放到session_state,交给主线程更新UI st.session_state.conversation.append({"user": question, "assistant": response}) time.sleep(0.5) except Exception as e: # 把错误信息放到session_state st.session_state.conversation.append({"error": f"Error: {str(e)}"}) time.sleep(1) # Generate response using OpenAI API def generate_response(question): try: response = openai.Completion.create( model="text-davinci-003", prompt=f"Q: {question}\nA:", max_tokens=150, temperature=0.7 ) return response['choices'][0]['text'].strip() except Exception as e: return f"Failed to generate response: {str(e)}" # Start audio stream in a separate thread def start_audio_stream(): try: # 指定采样率和通道,和Whisper匹配 stream = sd.InputStream( samplerate=SAMPLERATE, channels=CHANNELS, callback=audio_callback ) with stream: # 启动转录线程 threading.Thread(target=transcribe_and_respond, daemon=True).start() # 保持主线程运行 while True: time.sleep(0.1) except Exception as e: st.session_state.conversation.append({"error": f"Audio stream error: {str(e)}"}) # 主线程:定期更新UI def update_ui(): while True: with conversation_container: # 清空容器再重新渲染对话 st.empty() for msg in st.session_state.conversation: if "user" in msg: st.markdown(f"**You:** {msg['user']}") elif "assistant" in msg: st.markdown(f"**Assistant:** {msg['assistant']}") elif "error" in msg: st.error(msg['error']) time.sleep(0.5) # 启动音频流和UI更新线程 if not st.session_state.get("started", False): st.session_state["started"] = True threading.Thread(target=start_audio_stream, daemon=True).start() threading.Thread(target=update_ui, daemon=True).start()
额外注意事项:
- 确保你已经在系统环境变量中设置了
OPENAI_API_KEY,或者直接在代码中赋值(不建议硬编码密钥) - 运行前确认麦克风权限已经完全开启(系统设置→隐私与安全性→麦克风,确保Streamlit被授权)
- 如果还是识别不到麦克风,可以用
sd.query_devices()查看所有可用设备,然后指定设备ID到InputStream的device参数中
备注:内容来源于stack exchange,提问作者Melissa




