You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

PyMuPDF提取双栏PDF简历乱序问题求助及代码优化

双栏PDF简历文本提取解决方案

我用PyMuPDF编写了PDF简历提取代码,单栏简历处理正常,但Canva等平台生成的双栏简历提取结果混乱,无法正确获取信息,寻求解决办法。

双栏简历示例:典型左右分栏布局,左侧为个人基础信息、技能列表,右侧为工作经历、项目经验等内容。

现有代码

extract.py

import fitz  # PyMuPDF
import re
import sys
from pathlib import Path

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        blocks = page.get_text("blocks")  # Retrieve the text blocks
        # Sorting by vertical position to maintain the order
        blocks.sort(key=lambda b: b[1])
        for b in blocks:
            text += b[4].strip() + "\n"  # b[4] Contains the text of the block
        text += "\n"  # Page separator
    return text


def clean_text(text: str) -> str:
    # Remove multiple spaces but keep line breaks
    text = re.sub(r"[ \t]+", " ", text)      # Multiple spaces → single space
    text = re.sub(r"\n{2,}", "\n\n", text)  # 2+ empty lines → 2 lines
    text = text.strip()
    return text

def save_text_to_file(text: str, output_path: str):
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)

main.py

import sys
from pathlib import Path

from extract import extract_text_from_pdf, clean_text, save_text_to_file

def main(pdf_path: str):
    raw_text = extract_text_from_pdf(pdf_path)
    cleaned_text = clean_text(raw_text)

    output_path = Path(pdf_path).with_suffix(".txt")
    save_text_to_file(cleaned_text, output_path)

    print(f"[OK] Extracted and cleaned text → {output_path}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python extract.py path/to/file.pdf")
    else:
        main(sys.argv[1])

解决方案

问题核心是现有代码仅按文本块的垂直位置排序,双栏布局下左右栏的文本块会被交替拼接,导致顺序混乱。可通过以下两种方法修复:

方法1:基于文本块坐标分栏排序

通过判断文本块的水平位置,将页面拆分为左右两栏,分别处理每栏的文本顺序:

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        blocks = page.get_text("blocks")
        # 获取页面宽度,以中间位置作为分栏阈值
        page_width = page.rect.width
        column_threshold = page_width / 2

        # 拆分左右栏文本块
        left_blocks = []
        right_blocks = []
        for b in blocks:
            block_x0 = b[0]  # 文本块左上角x坐标
            if block_x0 < column_threshold:
                left_blocks.append(b)
            else:
                right_blocks.append(b)
        
        # 分别按垂直位置排序
        left_blocks.sort(key=lambda x: x[1])
        right_blocks.sort(key=lambda x: x[1])

        # 先拼接左栏文本,再拼接右栏文本
        for b in left_blocks:
            text += b[4].strip() + "\n"
        text += "\n"  # 栏分隔
        for b in right_blocks:
            text += b[4].strip() + "\n"
        text += "\n"  # 页分隔
    return text

方法2:基于单词级别的细粒度排序

如果文本块划分不清晰,可使用page.get_text("words")获取每个单词的坐标,先按行分组,每行内按水平位置排序单词,再拼接成完整文本:

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        words = page.get_text("words")  # 每个元素格式:(x0, y0, x1, y1, word, ...)
        # 按y坐标分组,同一行的单词y坐标接近
        lines = {}
        for word in words:
            y_coord = round(word[1], 1)  # 保留1位小数避免精度误差
            if y_coord not in lines:
                lines[y_coord] = []
            lines[y_coord].append(word)
        
        # 先按行的垂直位置排序,再每行内按水平位置排序单词
        sorted_lines = sorted(lines.items(), key=lambda x: x[0])
        for y, line_words in sorted_lines:
            line_words.sort(key=lambda w: w[0])
            line_text = " ".join([w[4] for w in line_words])
            text += line_text + "\n"
        text += "\n"  # 页分隔
    return text

注意事项

  • 方法1适合规整的双栏布局,若分栏线并非严格居中,可根据实际简历调整column_threshold的比例(比如0.45或0.55)。
  • 方法2兼容性更强,适合复杂布局,但处理速度略慢,需注意单词间的空格拼接是否符合原文格式。

内容的提问来源于stack exchange,提问作者Nasser 23

火山引擎 最新活动