如何在Python中比较索引值,避免重复编写if语句?
处理IBM Watson转录数据生成字幕的Python方案
嘿,我来帮你搞定从IBM Watson的转录数据生成字幕这件事!你拿到的这种包含词汇、开始和结束时间戳的列表,正好可以用来拼接成符合阅读习惯的字幕片段,下面给你几个实用的思路和代码示例:
合并相邻词汇成字幕块
字幕不会每个单词单独显示,所以可以设定时长或字数阈值,把连续词汇合并成合适的字幕块。比如设定单条字幕最长显示2秒、最多8个单词:def generate_subtitles(transcript_section, max_duration=2.0, max_words=8): subtitles = [] current_words = [] current_start = None current_end = None for word, start, end in transcript_section: if not current_words: current_start = start current_end = end current_words.append(word) else: duration = end - current_start # 检查是否超过设定的时长或字数限制 if duration <= max_duration and len(current_words) < max_words: current_words.append(word) current_end = end else: # 生成一条完整字幕 subtitle_text = ' '.join(current_words) subtitles.append({ 'start': current_start, 'end': current_end, 'text': subtitle_text }) # 重置当前字幕块 current_words = [word] current_start = start current_end = end # 处理最后一组词汇 if current_words: subtitle_text = ' '.join(current_words) subtitles.append({ 'start': current_start, 'end': current_end, 'text': subtitle_text }) return subtitles # 示例调用 section = [['for', 5.77, 5.92], ['example', 5.93, 6.21], ['this', 6.22, 6.35], ['is', 6.36, 6.42], ['a', 6.43, 6.48], ['test', 6.49, 6.75]] subtitles = generate_subtitles(section) for sub in subtitles: print(f"[{sub['start']:.2f} - {sub['end']:.2f}] {sub['text']}")导出为标准SRT字幕格式
生成字幕块后,可以转换成播放器通用的SRT格式:def subtitles_to_srt(subtitles): srt_content = "" for idx, sub in enumerate(subtitles, 1): # 把秒数转换成SRT要求的时间格式:HH:MM:SS,mmm def format_time(seconds): hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = seconds % 60 return f"{hours:02d}:{minutes:02d}:{secs:06.3f}".replace('.', ',') start_time = format_time(sub['start']) end_time = format_time(sub['end']) srt_content += f"{idx}\n{start_time} --> {end_time}\n{sub['text']}\n\n" return srt_content # 示例调用,保存为本地SRT文件 srt = subtitles_to_srt(subtitles) with open('output.srt', 'w', encoding='utf-8') as f: f.write(srt)优化字幕语义完整性
如果想避免把完整短语拆分成两个字幕,要是Watson的完整响应里包含短语或标点信息,可以优先按这些标记拆分;如果只有单个词汇,也可以用轻量NLP工具(比如spaCy)做简单句法分析,确保每条字幕是完整的语义单元。
内容的提问来源于stack exchange,提问作者Brendan Carlin




