You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Swift中使用Scanner解析包含多行文本的.srt字幕文件?

解析带多行文本的SRT文件问题解决指南

我帮你看看这个SRT解析的问题哈~你现在用Swift的Scanner解析包含多行字幕文本的SRT文件时遇到了麻烦,核心问题是原来的逻辑只处理了单行文本,但实际你的字幕内容是跨多行的。先梳理下你的场景:

你的SRT文件内容

1
00:00:00,000 --> 00:00:01,000
This is the first line:
and it has a secondary line,
it may have more lines
2
00:00:01,000 --> 00:00:02,000
This is the second line
it may have more lines
3
00:00:02,000 --> 00:00:03,000
This is the last line
and it has a secondary line too,
it may have more lines

你现有的解析代码

var indexString: NSString?
scanner.scanUpToCharacters(from: CharacterSet.newlines, into: &indexString)
var startTimeString: NSString?
scanner.scanUpTo(" --> ", into: &startTimeString)
scanner.scanString("-->", into: nil)
var endTimeString: NSString?
scanner.scanUpToCharacters(from: CharacterSet.newlines, into: &endTimeString)
var textString: NSString?
scanner.scanUpTo("\n", into: &textString)
if textString != nil {
    textString = (textString?.replacingOccurrences(of: "\r\n", with: " "))! as NSString
    textString = (textString?.trimmingCharacters(in: CharacterSet.whitespaces))! as NSString
}

问题分析

原来的代码里,scanner.scanUpTo("\n", into: &textString)只会读取到第一行换行就停止,而SRT格式中,单个字幕的文本可以跨多行,直到遇到空行或者**下一个字幕的序号(数字开头的行)**才结束。所以我们需要调整扫描文本的逻辑,让它能收集所有属于当前字幕的行。

修改后的解析代码

下面是调整后的完整解析函数,能正确处理多行字幕文本:

import Foundation

func parseSRT(_ content: String) -> [(index: Int, startTime: String, endTime: String, text: String)] {
    var subtitles = [(index: Int, startTime: String, endTime: String, text: String)]()
    let scanner = Scanner(string: content)
    scanner.charactersToBeSkipped = nil // 关闭自动跳过空白,确保能准确处理换行和空行
    
    while !scanner.isAtEnd {
        // 先跳过字幕之间的分隔空行,避免解析错误
        while !scanner.isAtEnd && scanner.scanCharacters(from: .whitespacesAndNewlines) != nil {}
        if scanner.isAtEnd { break }
        
        // 1. 扫描字幕序号
        var indexString: NSString?
        scanner.scanUpToCharacters(from: .newlines, into: &indexString)
        guard let indexStr = indexString, let index = Int(indexStr.trimmingCharacters(in: .whitespaces)) else {
            continue // 序号解析失败,跳过当前项
        }
        
        // 2. 扫描时间戳行,直接读取整行后分割
        var timeLineString: NSString?
        scanner.scanUpToCharacters(from: .newlines, into: &timeLineString)
        guard let timeLine = timeLineString?.trimmingCharacters(in: .whitespaces),
              let timeSeparatorRange = timeLine.range(of: " --> ") else {
            continue // 时间戳格式错误,跳过
        }
        let startTime = String(timeLine[..<timeSeparatorRange.lowerBound]).trimmingCharacters(in: .whitespaces)
        let endTime = String(timeLine[timeSeparatorRange.upperBound...]).trimmingCharacters(in: .whitespaces)
        
        // 3. 扫描多行字幕文本,直到遇到空行或下一个字幕序号
        var textLines = [String]()
        while !scanner.isAtEnd {
            var lineString: NSString?
            scanner.scanUpToCharacters(from: .newlines, into: &lineString)
            guard let line = lineString?.trimmingCharacters(in: .whitespaces) else {
                break
            }
            
            // 判断当前行是否是下一个字幕的序号(数字开头)
            if let firstChar = line.first, firstChar.isNumber {
                // 把这个序号放回扫描器,留给下一轮循环处理
                scanner.scanLocation -= line.count
                break
            }
            
            if !line.isEmpty {
                textLines.append(line)
            } else {
                // 遇到空行,结束当前字幕文本的扫描
                break
            }
        }
        
        // 把多行文本拼接成一个字符串(可以用"\n"保留换行,或者用" "替换,看你的需求)
        let subtitleText = textLines.joined(separator: " ")
        subtitles.append((index: index, startTime: startTime, endTime: endTime, text: subtitleText))
    }
    
    return subtitles
}

// 测试用例:传入你的SRT内容
let testSRT = """
1
00:00:00,000 --> 00:00:01,000
This is the first line:
and it has a secondary line,
it may have more lines
2
00:00:01,000 --> 00:00:02,000
This is the second line
it may have more lines
3
00:00:02,000 --> 00:00:03,000
This is the last line
and it has a secondary line too,
it may have more lines
"""

// 调用解析函数并打印结果
let parsedResult = parseSRT(testSRT)
for subtitle in parsedResult {
    print("序号:\(subtitle.index)")
    print("开始时间:\(subtitle.startTime)")
    print("结束时间:\(subtitle.endTime)")
    print("字幕文本:\(subtitle.text)\n")
}

关键改进点

  • 关闭自动跳过空白:设置scanner.charactersToBeSkipped = nil,这样Scanner不会自动忽略换行、空格等,能准确处理SRT中的空行和换行。
  • 处理分隔空行:在解析每个字幕前,先跳过可能存在的空行,避免把空行当作序号或文本处理。
  • 多行文本扫描逻辑:循环读取每一行,判断是否是下一个字幕的序号(数字开头),如果是就回退扫描位置;如果是空行就停止扫描,把所有非空行拼接成完整的字幕文本。
  • 更可靠的时间戳解析:直接读取整行时间戳,然后通过分隔符" --> "分割成开始和结束时间,比原来的分步扫描更稳定。

内容的提问来源于stack exchange,提问作者Rama Mahapatra

火山引擎 最新活动