基于Streamlit+Whisper的实时语音问答APP无法监听音频并返回响应的问题求助

阿华AIGC实验室

2026-4-13

大家好，我正在开发一个实时语音问答应用，用到了Streamlit、OpenAI Whisper和SoundDevice，原本期望实现这些功能：

实时监听麦克风的音频输入
（可选）用Whisper把音频转成文字
调用OpenAI语言模型生成回答
在Streamlit界面实时显示转录的问题和AI的回答

但目前应用完全没法按预期工作——启动后UI显示“Listening for interview questions...”，但对着麦克风说话完全没反应，偶尔还会直接崩溃弹出“Connection Error”提示。

我已经尝试了这些方法，但都没解决问题：

把Whisper模型从base换成tiny，减少内存占用
用单独的线程处理音频捕获和转录，避免阻塞UI
在搭载M2 Pro芯片的macOS Monterey上运行
重启Streamlit应用，并且确认麦克风权限已经开启

我的开发环境：

OS：macOS Monterey 12.6
Python版本：3.10
Streamlit版本：1.25.0
Whisper版本：最新版
SoundDevice版本：最新版
硬件：2024款MacBook Pro（M2 Pro芯片）

以下是我的代码：

import os
import openai
import whisper
import streamlit as st
import sounddevice as sd
import numpy as np
import queue
import threading
import time

# Initialize Whisper model
whisper_model = whisper.load_model('tiny')

# Streamlit setup
st.title("AI/ML Interview Assistant")
st.markdown("Listening for interview questions...")

# Real-time audio queue
audio_queue = queue.Queue()

# Audio callback to capture microphone input
def audio_callback(indata, frames, time, status):
    audio_queue.put(indata.copy())

# Transcribe audio and generate responses
def transcribe_and_respond():
    audio_data = []
    while True:
        try:
            if not audio_queue.empty():
                audio_data.append(audio_queue.get())
                if len(audio_data) > 20:
                    audio_segment = np.concatenate(audio_data, axis=0)
                    audio_data.clear()
                    transcription = whisper_model.transcribe(audio_segment)
                    question = transcription['text']
                    st.text(f"You: {question}")
                    response = generate_response(question)
                    st.text(f"Assistant: {response}")
                    time.sleep(1)
        except Exception as e:
            st.error(f"Error during transcription: {str(e)}")

# Generate response using OpenAI API
def generate_response(question):
    try:
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=f"Q: {question}\nA:",
            max_tokens=150
        )
        return response['choices'][0]['text'].strip()
    except Exception as e:
        return f"Error generating response: {str(e)}"

# Start audio stream in a separate thread
def start_audio_stream():
    try:
        stream = sd.InputStream(callback=audio_callback)
        with stream:
            threading.Thread(target=transcribe_and_respond, daemon=True).start()
            while True:
                time.sleep(0.1)
    except Exception as e:
        st.error(f"Audio stream error: {str(e)}")

# Start the audio stream
start_audio_stream()

针对你的问题，我梳理了几个核心问题点和对应的修复方案：

1. Streamlit线程安全问题（最关键）

Streamlit的UI组件不能在后台线程直接更新，你现在在transcribe_and_respond里直接调用st.text()和st.error()，这会导致UI渲染混乱，甚至触发崩溃。正确的做法是用st.session_state作为数据桥梁，后台线程只负责生成数据，主线程定期读取并更新UI。

2. 音频捕获逻辑不完善

你现在判断len(audio_data) > 20才处理音频，但这个数值和实际的音频时长无关（不同采样率下，每帧的时间长度不同），而且没有过滤静音数据，可能导致一直积累不到触发条件，或者处理大量无效静音。

3. OpenAI API密钥未配置

你的代码里完全没有设置openai.api_key，这会直接导致API调用失败，出现Connection Error。

4. SoundDevice设备适配问题

M2 Mac上可能默认音频设备不是麦克风，或者SoundDevice没有正确识别到输入设备，导致无法捕获音频。

修复后的完整代码示例

import os
import openai
import whisper
import streamlit as st
import sounddevice as sd
import numpy as np
import queue
import threading
import time

# 配置OpenAI API密钥（建议从环境变量读取，不要硬编码）
openai.api_key = os.getenv("OPENAI_API_KEY")

# Initialize Whisper model
whisper_model = whisper.load_model('tiny')

# Streamlit初始化
st.title("AI/ML Interview Assistant")
status_text = st.markdown("Listening for interview questions...")

# 用session_state存储对话历史，避免线程安全问题
if "conversation" not in st.session_state:
    st.session_state.conversation = []
conversation_container = st.container()

# Real-time audio queue
audio_queue = queue.Queue()
# 音频参数配置，和Whisper默认一致
SAMPLERATE = 16000
CHANNELS = 1

# 静音检测阈值，过滤无效静音
SILENCE_THRESHOLD = 0.01

# Audio callback to capture microphone input
def audio_callback(indata, frames, time, status):
    if status:
        print(status)
    # 只保留超过静音阈值的音频数据
    if np.abs(indata).mean() > SILENCE_THRESHOLD:
        audio_queue.put(indata.copy())

# Transcribe audio and generate responses（后台线程，只处理数据不更新UI）
def transcribe_and_respond():
    audio_buffer = []
    # 累计约2秒的音频再处理（16000采样率下，每帧默认是1024，2秒约31帧）
    MAX_BUFFER_FRAMES = 30
    while True:
        try:
            if not audio_queue.empty():
                audio_buffer.append(audio_queue.get())
                if len(audio_buffer) >= MAX_BUFFER_FRAMES:
                    # 拼接音频并转换为Whisper需要的格式
                    audio_segment = np.concatenate(audio_buffer, axis=0).flatten()
                    audio_buffer.clear()
                    # Whisper处理
                    transcription = whisper_model.transcribe(audio_segment, language="en")
                    question = transcription['text'].strip()
                    if question:  # 确保转录到有效文本
                        # 生成回答
                        response = generate_response(question)
                        # 把结果放到session_state，交给主线程更新UI
                        st.session_state.conversation.append({"user": question, "assistant": response})
                    time.sleep(0.5)
        except Exception as e:
            # 把错误信息放到session_state
            st.session_state.conversation.append({"error": f"Error: {str(e)}"})
            time.sleep(1)

# Generate response using OpenAI API
def generate_response(question):
    try:
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=f"Q: {question}\nA:",
            max_tokens=150,
            temperature=0.7
        )
        return response['choices'][0]['text'].strip()
    except Exception as e:
        return f"Failed to generate response: {str(e)}"

# Start audio stream in a separate thread
def start_audio_stream():
    try:
        # 指定采样率和通道，和Whisper匹配
        stream = sd.InputStream(
            samplerate=SAMPLERATE,
            channels=CHANNELS,
            callback=audio_callback
        )
        with stream:
            # 启动转录线程
            threading.Thread(target=transcribe_and_respond, daemon=True).start()
            # 保持主线程运行
            while True:
                time.sleep(0.1)
    except Exception as e:
        st.session_state.conversation.append({"error": f"Audio stream error: {str(e)}"})

# 主线程：定期更新UI
def update_ui():
    while True:
        with conversation_container:
            # 清空容器再重新渲染对话
            st.empty()
            for msg in st.session_state.conversation:
                if "user" in msg:
                    st.markdown(f"**You:** {msg['user']}")
                elif "assistant" in msg:
                    st.markdown(f"**Assistant:** {msg['assistant']}")
                elif "error" in msg:
                    st.error(msg['error'])
        time.sleep(0.5)

# 启动音频流和UI更新线程
if not st.session_state.get("started", False):
    st.session_state["started"] = True
    threading.Thread(target=start_audio_stream, daemon=True).start()
    threading.Thread(target=update_ui, daemon=True).start()