PyMuPDF提取双栏PDF简历乱序问题求助及代码优化

免费开始使用

PyMuPDF提取双栏PDF简历乱序问题求助及代码优化

阿华AIGC实验室

2026-6-12

双栏PDF简历文本提取解决方案

我用PyMuPDF编写了PDF简历提取代码，单栏简历处理正常，但Canva等平台生成的双栏简历提取结果混乱，无法正确获取信息，寻求解决办法。

双栏简历示例：典型左右分栏布局，左侧为个人基础信息、技能列表，右侧为工作经历、项目经验等内容。

现有代码

extract.py

import fitz  # PyMuPDF
import re
import sys
from pathlib import Path

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        blocks = page.get_text("blocks")  # Retrieve the text blocks
        # Sorting by vertical position to maintain the order
        blocks.sort(key=lambda b: b[1])
        for b in blocks:
            text += b[4].strip() + "\n"  # b[4] Contains the text of the block
        text += "\n"  # Page separator
    return text


def clean_text(text: str) -> str:
    # Remove multiple spaces but keep line breaks
    text = re.sub(r"[ \t]+", " ", text)      # Multiple spaces → single space
    text = re.sub(r"\n{2,}", "\n\n", text)  # 2+ empty lines → 2 lines
    text = text.strip()
    return text

def save_text_to_file(text: str, output_path: str):
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)

main.py

import sys
from pathlib import Path

from extract import extract_text_from_pdf, clean_text, save_text_to_file

def main(pdf_path: str):
    raw_text = extract_text_from_pdf(pdf_path)
    cleaned_text = clean_text(raw_text)

    output_path = Path(pdf_path).with_suffix(".txt")
    save_text_to_file(cleaned_text, output_path)

    print(f"[OK] Extracted and cleaned text → {output_path}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python extract.py path/to/file.pdf")
    else:
        main(sys.argv[1])

解决方案

问题核心是现有代码仅按文本块的垂直位置排序，双栏布局下左右栏的文本块会被交替拼接，导致顺序混乱。可通过以下两种方法修复：

方法1：基于文本块坐标分栏排序

通过判断文本块的水平位置，将页面拆分为左右两栏，分别处理每栏的文本顺序：

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        blocks = page.get_text("blocks")
        # 获取页面宽度，以中间位置作为分栏阈值
        page_width = page.rect.width
        column_threshold = page_width / 2

        # 拆分左右栏文本块
        left_blocks = []
        right_blocks = []
        for b in blocks:
            block_x0 = b[0]  # 文本块左上角x坐标
            if block_x0 < column_threshold:
                left_blocks.append(b)
            else:
                right_blocks.append(b)
        
        # 分别按垂直位置排序
        left_blocks.sort(key=lambda x: x[1])
        right_blocks.sort(key=lambda x: x[1])

        # 先拼接左栏文本，再拼接右栏文本
        for b in left_blocks:
            text += b[4].strip() + "\n"
        text += "\n"  # 栏分隔
        for b in right_blocks:
            text += b[4].strip() + "\n"
        text += "\n"  # 页分隔
    return text

方法2：基于单词级别的细粒度排序

如果文本块划分不清晰，可使用page.get_text("words")获取每个单词的坐标，先按行分组，每行内按水平位置排序单词，再拼接成完整文本：

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        words = page.get_text("words")  # 每个元素格式：(x0, y0, x1, y1, word, ...)
        # 按y坐标分组，同一行的单词y坐标接近
        lines = {}
        for word in words:
            y_coord = round(word[1], 1)  # 保留1位小数避免精度误差
            if y_coord not in lines:
                lines[y_coord] = []
            lines[y_coord].append(word)
        
        # 先按行的垂直位置排序，再每行内按水平位置排序单词
        sorted_lines = sorted(lines.items(), key=lambda x: x[0])
        for y, line_words in sorted_lines:
            line_words.sort(key=lambda w: w[0])
            line_text = " ".join([w[4] for w in line_words])
            text += line_text + "\n"
        text += "\n"  # 页分隔
    return text

注意事项

方法1适合规整的双栏布局，若分栏线并非严格居中，可根据实际简历调整column_threshold的比例（比如0.45或0.55）。
方法2兼容性更强，适合复杂布局，但处理速度略慢，需注意单词间的空格拼接是否符合原文格式。

内容的提问来源于stack exchange，提问作者Nasser 23

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠