PyMuPDF提取双栏PDF简历乱序问题求助及代码优化
双栏PDF简历文本提取解决方案
我用PyMuPDF编写了PDF简历提取代码,单栏简历处理正常,但Canva等平台生成的双栏简历提取结果混乱,无法正确获取信息,寻求解决办法。
双栏简历示例:典型左右分栏布局,左侧为个人基础信息、技能列表,右侧为工作经历、项目经验等内容。
现有代码
extract.py
import fitz # PyMuPDF import re import sys from pathlib import Path def extract_text_from_pdf(pdf_path: str) -> str: doc = fitz.open(pdf_path) text = "" for page in doc: blocks = page.get_text("blocks") # Retrieve the text blocks # Sorting by vertical position to maintain the order blocks.sort(key=lambda b: b[1]) for b in blocks: text += b[4].strip() + "\n" # b[4] Contains the text of the block text += "\n" # Page separator return text def clean_text(text: str) -> str: # Remove multiple spaces but keep line breaks text = re.sub(r"[ \t]+", " ", text) # Multiple spaces → single space text = re.sub(r"\n{2,}", "\n\n", text) # 2+ empty lines → 2 lines text = text.strip() return text def save_text_to_file(text: str, output_path: str): with open(output_path, "w", encoding="utf-8") as f: f.write(text)
main.py
import sys from pathlib import Path from extract import extract_text_from_pdf, clean_text, save_text_to_file def main(pdf_path: str): raw_text = extract_text_from_pdf(pdf_path) cleaned_text = clean_text(raw_text) output_path = Path(pdf_path).with_suffix(".txt") save_text_to_file(cleaned_text, output_path) print(f"[OK] Extracted and cleaned text → {output_path}") if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: python extract.py path/to/file.pdf") else: main(sys.argv[1])
解决方案
问题核心是现有代码仅按文本块的垂直位置排序,双栏布局下左右栏的文本块会被交替拼接,导致顺序混乱。可通过以下两种方法修复:
方法1:基于文本块坐标分栏排序
通过判断文本块的水平位置,将页面拆分为左右两栏,分别处理每栏的文本顺序:
def extract_text_from_pdf(pdf_path: str) -> str: doc = fitz.open(pdf_path) text = "" for page in doc: blocks = page.get_text("blocks") # 获取页面宽度,以中间位置作为分栏阈值 page_width = page.rect.width column_threshold = page_width / 2 # 拆分左右栏文本块 left_blocks = [] right_blocks = [] for b in blocks: block_x0 = b[0] # 文本块左上角x坐标 if block_x0 < column_threshold: left_blocks.append(b) else: right_blocks.append(b) # 分别按垂直位置排序 left_blocks.sort(key=lambda x: x[1]) right_blocks.sort(key=lambda x: x[1]) # 先拼接左栏文本,再拼接右栏文本 for b in left_blocks: text += b[4].strip() + "\n" text += "\n" # 栏分隔 for b in right_blocks: text += b[4].strip() + "\n" text += "\n" # 页分隔 return text
方法2:基于单词级别的细粒度排序
如果文本块划分不清晰,可使用page.get_text("words")获取每个单词的坐标,先按行分组,每行内按水平位置排序单词,再拼接成完整文本:
def extract_text_from_pdf(pdf_path: str) -> str: doc = fitz.open(pdf_path) text = "" for page in doc: words = page.get_text("words") # 每个元素格式:(x0, y0, x1, y1, word, ...) # 按y坐标分组,同一行的单词y坐标接近 lines = {} for word in words: y_coord = round(word[1], 1) # 保留1位小数避免精度误差 if y_coord not in lines: lines[y_coord] = [] lines[y_coord].append(word) # 先按行的垂直位置排序,再每行内按水平位置排序单词 sorted_lines = sorted(lines.items(), key=lambda x: x[0]) for y, line_words in sorted_lines: line_words.sort(key=lambda w: w[0]) line_text = " ".join([w[4] for w in line_words]) text += line_text + "\n" text += "\n" # 页分隔 return text
注意事项
- 方法1适合规整的双栏布局,若分栏线并非严格居中,可根据实际简历调整
column_threshold的比例(比如0.45或0.55)。 - 方法2兼容性更强,适合复杂布局,但处理速度略慢,需注意单词间的空格拼接是否符合原文格式。
内容的提问来源于stack exchange,提问作者Nasser 23




