如何去除Tesseract提取PDF/图片文本中的多余换行
解决Tesseract提取文本后多余空行的问题
我来帮你搞定这个Tesseract提取文本后出现多余空行的问题!
你的核心需求很明确:去掉提取结果里那些无意义的空行,但保留正常的段落/行之间的换行。之前的代码之所以失败,是因为你用了""来连接行,直接把所有换行符都干掉了,导致所有内容挤在了一起。
正确的解决方案
你只需要过滤掉那些纯空白的行,同时保留有效行之间的换行即可。试试这段代码:
cleaned_result = "\n".join(line for line in result.splitlines() if line.strip())
代码解释
result.splitlines():把Tesseract返回的文本按行分割,每一行不带末尾的换行符;if line.strip():过滤掉那些只包含空白字符(空格、换行、制表符等)的行,也就是你要去掉的空行;"\n".join(...):把剩下的有效行用换行符连接起来,这样就保留了正常的行分隔,同时去掉了多余空行。
效果对比
- Tesseract提取的原始输出:
Review Dear All, Turning The Corner On A Challenge full Year (This is a heading) We are executing and gaining mobile share in our markets in Australia against intense competition. However, the past year was
- 处理后的结果:
Review Dear All, Turning The Corner On A Challenge full Year (This is a heading) We are executing and gaining mobile share in our markets in Australia against intense competition. However, the past year was
完全符合你的预期!
内容的提问来源于stack exchange,提问作者Rahul jain




