如何在PHP Web应用中集成Mozilla DeepSpeech实现语音转文本及唤醒功能？

阿华AIGC实验室

2026-5-29

Can Mozilla DeepSpeech Handle Firefox-Recorded WAV Audio?

Absolutely! Mozilla DeepSpeech works seamlessly with WAV recordings from Firefox, as long as you capture the audio in the exact format DeepSpeech expects: 16kHz sample rate, 16-bit mono PCM. Firefox’s MediaRecorder API can be configured to output this natively, so you won’t run into compatibility hurdles.

Step-by-Step Workflow: From Firefox Mic to DeepSpeech Transcription

Here’s a detailed, actionable breakdown of the end-to-end process:

1. Capture Audio in Firefox (Frontend)

Use the Web Audio API and MediaRecorder to grab audio from the user’s microphone with the correct specs. Here’s a practical code snippet to kick things off:

// Request mic access with DeepSpeech-compatible constraints
async function startCommandRecording() {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        sampleRate: 16000,
        channelCount: 1,
        sampleSize: 16,
        echoCancellation: true // Optional but improves audio quality
      }
    });

    // Initialize MediaRecorder to output WAV in the required format
    const mediaRecorder = new MediaRecorder(stream, {
      mimeType: 'audio/wav; codecs=pcm_s16le'
    });

    let audioChunks = [];
    mediaRecorder.ondataavailable = (e) => audioChunks.push(e.data);
    mediaRecorder.onstop = async () => {
      const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
      await sendAudioToBackend(audioBlob);
      stream.getTracks().forEach(track => track.stop()); // Clean up mic access
    };

    // Start recording (stop after 5 seconds or let the user trigger stop)
    mediaRecorder.start();
    setTimeout(() => mediaRecorder.stop(), 5000);
  } catch (err) {
    console.error("Mic access denied or error:", err);
  }
}

// Send recorded audio to PHP backend for transcription
async function sendAudioToBackend(blob) {
  const formData = new FormData();
  formData.append('command_audio', blob, 'voice_command.wav');
  
  const response = await fetch('/process-voice-command.php', {
    method: 'POST',
    body: formData
  });
  
  const result = await response.json();
  // Map transcribed text to page navigation
  handleCommand(result.transcript);
}

function handleCommand(transcript) {
  const lowerTranscript = transcript.toLowerCase();
  if (lowerTranscript.includes('make sales')) {
    window.location.href = '/create-sales.php';
  } else if (lowerTranscript.includes('make purchase order')) {
    window.location.href = '/create-purchase.php';
  } else if (lowerTranscript.includes('open end-of-day reports')) {
    window.location.href = '/eod-reports.php';
  }
  // Add more command mappings here
}

2. Process Audio in PHP Backend

PHP doesn’t have an official DeepSpeech binding, so you’ll interface with DeepSpeech via its command-line tool (most straightforward for PHP setups). First, install DeepSpeech on your server (follow their official docs for your OS), then create a process-voice-command.php endpoint:

<?php
header('Content-Type: application/json');

if ($_FILES['command_audio']['error'] !== UPLOAD_ERR_OK) {
  echo json_encode(['error' => 'Failed to upload audio file']);
  exit;
}

$tempFilePath = $_FILES['command_audio']['tmp_name'];
$modelPath = '/path/to/deepspeech-0.9.3-models.pbmm';
$scorerPath = '/path/to/deepspeech-0.9.3-models.scorer';

// Run DeepSpeech command to transcribe the audio
$command = sprintf(
  'deepspeech --model %s --scorer %s %s',
  escapeshellarg($modelPath),
  escapeshellarg($scorerPath),
  escapeshellarg($tempFilePath)
);

exec($command, $output, $exitCode);

if ($exitCode !== 0) {
  echo json_encode(['error' => 'Transcription failed']);
  exit;
}

$transcript = trim(implode(' ', $output));
echo json_encode(['transcript' => $transcript]);
?>

Note: Ensure your web server has permission to execute the deepspeech command, and that the model/scorer files are in a accessible path.

3. Map Transcript to Actions

As shown in the frontend handleCommand function, match the transcribed text to your desired page navigation. Use case-insensitive checks and partial matches to make the system more robust (users won’t always speak commands perfectly).

Implementing a Wake Word Feature (Like "OK-GOOGLE")

DeepSpeech doesn’t include native wake word detection, but you can pair it with an open source solution for an always-listening experience. Here are two practical approaches:

Option 1: Client-Side Wake Word Detector (Recommended)

For better performance, run wake word detection directly in the browser using a lightweight open source library like Porcupine Open Source (works in Firefox) or Vosk. Once the wake word is detected, trigger the audio recording/transcription flow outlined earlier.

Simplified example with Porcupine:

async function initWakeWordDetector() {
  // Initialize Porcupine with your custom wake word (e.g., "OK WEBAPP")
  const porcupine = await Porcupine.fromKeywords(['ok webapp'], 'your-access-key');
  
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const audioContext = new AudioContext({ sampleRate: porcupine.sampleRate });
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(4096, 1, 1);

  source.connect(processor);
  processor.connect(audioContext.destination);

  processor.onaudioprocess = (e) => {
    const audioFrame = e.inputBuffer.getChannelData(0);
    const keywordIndex = porcupine.process(audioFrame);
    
    if (keywordIndex >= 0) {
      // Wake word detected! Start recording the command
      startCommandRecording();
      // Pause wake word detection to avoid re-triggering during command
      processor.disconnect();
      // Re-enable after command processing if needed
      setTimeout(() => initWakeWordDetector(), 6000);
    }
  };
}

// Initialize on page load
initWakeWordDetector();

Option 2: Continuous DeepSpeech Transcription (Less Efficient)

If you want to use only DeepSpeech, run continuous background transcription and check each segment for your wake word. This uses more CPU but avoids adding another library. Optimize by running shorter inference windows and looking for the wake phrase in the output.

Key Tips

Model Optimization: Use the quantized DeepSpeech model (.pbmm format) for faster CPU-based transcription.
Error Handling: Add user-friendly messages for mic access denial, upload failures, and unrecognized commands.
Performance: Offload transcription to the backend to keep the frontend responsive, especially on low-end devices.

内容的提问来源于stack exchange，提问作者Priyesh