基于Google Cloud API：带时间戳音转文后转音并保留停顿时长

阿华AIGC实验室

2026-5-8

Great question! I’ve worked through this exact use case with Google Cloud’s speech APIs, and the secret sauce is using the word-level time stamps from your Speech-to-Text output to build a hyper-precise SSML script. This script will enforce the original audio’s timing—down to individual word durations and pauses between words—so your new voice matches the original’s runtime perfectly. Let’s break this down:

Core Approach

Instead of feeding the raw transcript directly to Text-to-Speech, we’ll:

Extract every word’s start/end time from your Speech-to-Text JSON response
Calculate the exact pause length between each word (and before the first word, if there’s leading silence)
Build an SSML document that uses <prosody> to lock each word’s duration to match the original, and <break> to replicate silences
Pass this SSML to Text-to-Speech to generate the time-aligned replacement audio

Step 1: Parse Your Speech-to-Text Results

First, pull the word-level timing data from the JSON response you already have. Let’s assume your response looks something like this (trimmed for brevity):

{
  "results": [
    {
      "alternatives": [
        {
          "words": [
            {"word": "Hello", "start_time": "0.0s", "end_time": "0.5s"},
            {"word": "world", "start_time": "0.7s", "end_time": "1.2s"},
            {"word": "this", "start_time": "1.5s", "end_time": "1.8s"}
          ]
        }
      ]
    }
  ]
}

In code (Python example), you’d extract the words list like so:

# Assume `speech_to_text_response` is your parsed JSON result
words = speech_to_text_response["results"][0]["alternatives"][0]["words"]

Step 2: Calculate Timing Details

Next, we’ll loop through each word to compute:

How long to pause before the word starts (compared to the end of the previous word)
How long the word itself should be spoken (to match the original’s duration)

timing_data = []
previous_end_time = 0.0

for word_entry in words:
    # Convert time strings to float values (strip the "s" suffix)
    start_time = float(word_entry["start_time"].replace("s", ""))
    end_time = float(word_entry["end_time"].replace("s", ""))
    
    # Calculate pause before this word (avoid negative values from minor rounding errors)
    pause_before = max(0.0, start_time - previous_end_time)
    # Calculate how long the word should be spoken
    word_duration = end_time - start_time
    
    timing_data.append({
        "word": word_entry["word"],
        "pause_before": pause_before,
        "duration": word_duration
    })
    
    # Update previous end time for the next iteration
    previous_end_time = end_time

Step 3: Build the Precision SSML

Now we’ll construct an SSML string that uses <break> for pauses and <prosody duration> to lock each word’s speaking time. This ensures the synthesized audio matches the original’s timing exactly:

ssml_parts = ["<speak>"]

for entry in timing_data:
    # Add pause before the word (ignore tiny pauses under 0.01s to avoid noise)
    if entry["pause_before"] > 0.01:
        ssml_parts.append(f'<break time="{entry["pause_before"]:.2f}s"/>')
    
    # Wrap the word in prosody to enforce its original duration
    ssml_parts.append(f'<prosody duration="{entry["duration"]:.2f}s">{entry["word"]}</prosody>')

ssml_parts.append("</speak>")
final_ssml = "".join(ssml_parts)

The resulting SSML will look like this for our sample words:

<speak>
<prosody duration="0.50s">Hello</prosody>
<break time="0.20s"/>
<prosody duration="0.50s">world</prosody>
<break time="0.30s"/>
<prosody duration="0.30s">this</prosody>
</speak>

Step 4: Synthesize the Time-Aligned Audio

Finally, pass this SSML to the Text-to-Speech API. Use a Wavenet voice—they handle duration constraints more naturally than standard voices:

from google.cloud import texttospeech

# Initialize the client
client = texttospeech.TextToSpeechClient()

# Set up the input with our SSML
synthesis_input = texttospeech.SynthesisInput(ssml=final_ssml)

# Choose your target voice (adjust name/language_code as needed)
voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Wavenet-D"  # Pick any Wavenet voice you prefer
)

# Configure audio output (MP3 is standard)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)

# Generate the audio
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

# Save the output to a file
with open("timing_matched_audio.mp3", "wb") as out_file:
    out_file.write(response.audio_content)

Key Notes & Troubleshooting

Timing Precision: Speech-to-Text’s time stamps are accurate to ~0.1 seconds, so your synthesized audio will be nearly identical in runtime to the original. Minor discrepancies (a few milliseconds) are normal due to voice synthesis engine nuances.
Punctuation Handling: If your original transcript includes punctuation (like commas or periods), cross-reference the full transcript with the word list to insert appropriate <break> tags for natural pauses.
Long Silences: For extended silences (e.g., 2+ seconds), the <break> tag works perfectly—just pass the full silence duration as the time value.
Voice Selection: Wavenet voices are highly recommended here because they handle forced duration constraints more naturally than standard voices. You can browse all available voices directly in the Google Cloud Console or via the API’s voice list endpoint.

内容的提问来源于stack exchange，提问作者CtrlAltSkills