如何在Swift中使用Scanner解析包含多行文本的.srt字幕文件?
解析带多行文本的SRT文件问题解决指南
我帮你看看这个SRT解析的问题哈~你现在用Swift的Scanner解析包含多行字幕文本的SRT文件时遇到了麻烦,核心问题是原来的逻辑只处理了单行文本,但实际你的字幕内容是跨多行的。先梳理下你的场景:
你的SRT文件内容
1 00:00:00,000 --> 00:00:01,000 This is the first line: and it has a secondary line, it may have more lines 2 00:00:01,000 --> 00:00:02,000 This is the second line it may have more lines 3 00:00:02,000 --> 00:00:03,000 This is the last line and it has a secondary line too, it may have more lines
你现有的解析代码
var indexString: NSString? scanner.scanUpToCharacters(from: CharacterSet.newlines, into: &indexString) var startTimeString: NSString? scanner.scanUpTo(" --> ", into: &startTimeString) scanner.scanString("-->", into: nil) var endTimeString: NSString? scanner.scanUpToCharacters(from: CharacterSet.newlines, into: &endTimeString) var textString: NSString? scanner.scanUpTo("\n", into: &textString) if textString != nil { textString = (textString?.replacingOccurrences(of: "\r\n", with: " "))! as NSString textString = (textString?.trimmingCharacters(in: CharacterSet.whitespaces))! as NSString }
问题分析
原来的代码里,scanner.scanUpTo("\n", into: &textString)只会读取到第一行换行就停止,而SRT格式中,单个字幕的文本可以跨多行,直到遇到空行或者**下一个字幕的序号(数字开头的行)**才结束。所以我们需要调整扫描文本的逻辑,让它能收集所有属于当前字幕的行。
修改后的解析代码
下面是调整后的完整解析函数,能正确处理多行字幕文本:
import Foundation func parseSRT(_ content: String) -> [(index: Int, startTime: String, endTime: String, text: String)] { var subtitles = [(index: Int, startTime: String, endTime: String, text: String)]() let scanner = Scanner(string: content) scanner.charactersToBeSkipped = nil // 关闭自动跳过空白,确保能准确处理换行和空行 while !scanner.isAtEnd { // 先跳过字幕之间的分隔空行,避免解析错误 while !scanner.isAtEnd && scanner.scanCharacters(from: .whitespacesAndNewlines) != nil {} if scanner.isAtEnd { break } // 1. 扫描字幕序号 var indexString: NSString? scanner.scanUpToCharacters(from: .newlines, into: &indexString) guard let indexStr = indexString, let index = Int(indexStr.trimmingCharacters(in: .whitespaces)) else { continue // 序号解析失败,跳过当前项 } // 2. 扫描时间戳行,直接读取整行后分割 var timeLineString: NSString? scanner.scanUpToCharacters(from: .newlines, into: &timeLineString) guard let timeLine = timeLineString?.trimmingCharacters(in: .whitespaces), let timeSeparatorRange = timeLine.range(of: " --> ") else { continue // 时间戳格式错误,跳过 } let startTime = String(timeLine[..<timeSeparatorRange.lowerBound]).trimmingCharacters(in: .whitespaces) let endTime = String(timeLine[timeSeparatorRange.upperBound...]).trimmingCharacters(in: .whitespaces) // 3. 扫描多行字幕文本,直到遇到空行或下一个字幕序号 var textLines = [String]() while !scanner.isAtEnd { var lineString: NSString? scanner.scanUpToCharacters(from: .newlines, into: &lineString) guard let line = lineString?.trimmingCharacters(in: .whitespaces) else { break } // 判断当前行是否是下一个字幕的序号(数字开头) if let firstChar = line.first, firstChar.isNumber { // 把这个序号放回扫描器,留给下一轮循环处理 scanner.scanLocation -= line.count break } if !line.isEmpty { textLines.append(line) } else { // 遇到空行,结束当前字幕文本的扫描 break } } // 把多行文本拼接成一个字符串(可以用"\n"保留换行,或者用" "替换,看你的需求) let subtitleText = textLines.joined(separator: " ") subtitles.append((index: index, startTime: startTime, endTime: endTime, text: subtitleText)) } return subtitles } // 测试用例:传入你的SRT内容 let testSRT = """ 1 00:00:00,000 --> 00:00:01,000 This is the first line: and it has a secondary line, it may have more lines 2 00:00:01,000 --> 00:00:02,000 This is the second line it may have more lines 3 00:00:02,000 --> 00:00:03,000 This is the last line and it has a secondary line too, it may have more lines """ // 调用解析函数并打印结果 let parsedResult = parseSRT(testSRT) for subtitle in parsedResult { print("序号:\(subtitle.index)") print("开始时间:\(subtitle.startTime)") print("结束时间:\(subtitle.endTime)") print("字幕文本:\(subtitle.text)\n") }
关键改进点
- 关闭自动跳过空白:设置
scanner.charactersToBeSkipped = nil,这样Scanner不会自动忽略换行、空格等,能准确处理SRT中的空行和换行。 - 处理分隔空行:在解析每个字幕前,先跳过可能存在的空行,避免把空行当作序号或文本处理。
- 多行文本扫描逻辑:循环读取每一行,判断是否是下一个字幕的序号(数字开头),如果是就回退扫描位置;如果是空行就停止扫描,把所有非空行拼接成完整的字幕文本。
- 更可靠的时间戳解析:直接读取整行时间戳,然后通过分隔符
" --> "分割成开始和结束时间,比原来的分步扫描更稳定。
内容的提问来源于stack exchange,提问作者Rama Mahapatra




