You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python按字体大小拆分PDF:适配pdfplumber与pyPdf坐标问题求助

Solution for Splitting PDFs by Font Size in Python

Let's tackle your problem from two practical angles: using an all-in-one library that avoids coordinate headaches, and fixing the coordinate mismatch if you want to stick with pdfplumber + PyPDF.

Option 1: Use PyMuPDF (fitz) – The All-in-One Solution

PyMuPDF is hands down the best tool for this job right now. It lets you extract text with full font properties (size, position, family) and manipulate PDF pages/regions seamlessly, all with a unified coordinate system (no conversion hoops needed).

Here's a quick example that splits a PDF by extracting regions containing text of a specific font size:

import fitz  # PyMuPDF

def split_pdf_by_font_size(input_path, output_path, target_font_size):
    doc = fitz.open(input_path)
    output_doc = fitz.open()

    for page_num in range(len(doc)):
        page = doc[page_num]
        # Extract all text blocks with detailed font info
        text_blocks = page.get_text("dict")["blocks"]
        
        for block in text_blocks:
            if "lines" not in block:
                continue
            # Check if any line in the block uses our target font size
            has_target_size = any(
                any(span["size"] == target_font_size for span in line["spans"])
                for line in block["lines"]
            )
            if has_target_size:
                # Get the bounding box of the matching block
                rect = fitz.Rect(block["bbox"])
                # Create a new page sized to the block
                new_page = output_doc.new_page(width=rect.width, height=rect.height)
                # Copy the target region from original to new page
                new_page.show_pdf_page(new_page.rect, doc, page_num, clip=rect)
    
    output_doc.save(output_path)
    output_doc.close()
    doc.close()

# Usage example
split_pdf_by_font_size("input.pdf", "output.pdf", 12.0)

This code pulls out every block of text that includes your target font size, creates dedicated pages for those blocks, and saves them into a clean new PDF. No coordinate confusion at all!

Option 2: Fix Coordinate Mismatch Between pdfplumber and PyPDF

If you prefer sticking with pdfplumber and PyPDF, the issue almost always boils down to page rotation handling. pdfplumber automatically normalizes coordinates to match the "readable" orientation of the page (so (0,0) is the bottom-left of the page as you see it), while PyPDF uses the original page's raw coordinate system (which might be rotated).

Here's how to convert pdfplumber's coordinates to PyPDF's system:

  1. First, grab the page rotation and dimensions from pdfplumber
  2. Adjust the bounding box based on the rotation angle to match PyPDF's raw coordinates

Here's a full working example:

import pdfplumber
from PyPDF2 import PdfReader, PdfWriter

def convert_coords(pdfplumber_bbox, page_width, page_height, rotation):
    x0, top, x1, bottom = pdfplumber_bbox
    # Convert based on page rotation
    if rotation == 0:
        return (x0, bottom, x1, top)
    elif rotation == 90:
        return (bottom, page_width - x1, top, page_width - x0)
    elif rotation == 180:
        return (page_width - x1, page_height - bottom, page_width - x0, page_height - top)
    elif rotation == 270:
        return (page_height - top, x0, page_height - bottom, x1)
    else:
        return (x0, bottom, x1, top)

def split_with_pdfplumber_pypdf(input_path, output_path, target_font_size):
    reader = PdfReader(input_path)
    writer = PdfWriter()

    with pdfplumber.open(input_path) as pdf:
        for page_num in range(len(reader.pages)):
            pdf_page = pdf.pages[page_num]
            pypdf_page = reader.pages[page_num]
            rotation = pdf_page.rotation
            page_height = pdf_page.height
            page_width = pdf_page.width

            # Extract all characters matching our target font size
            matching_chars = [char for char in pdf_page.chars if char["size"] == target_font_size]
            if not matching_chars:
                continue

            # Calculate the bounding box that covers all matching characters
            x0 = min(c["x0"] for c in matching_chars)
            top = max(c["top"] for c in matching_chars)
            x1 = max(c["x1"] for c in matching_chars)
            bottom = min(c["bottom"] for c in matching_chars)
            pdfplumber_bbox = (x0, top, x1, bottom)

            # Convert to PyPDF's coordinate system
            crop_box = convert_coords(pdfplumber_bbox, page_width, page_height, rotation)
            pypdf_page.cropbox = crop_box

            # Add the cropped page to our output
            writer.add_page(pypdf_page)

    with open(output_path, "wb") as out_file:
        writer.write(out_file)

# Usage example
split_with_pdfplumber_pypdf("input.pdf", "output.pdf", 12.0)

This code handles rotation properly, so your cropped regions will now align perfectly with what you see in pdfplumber's character positions.

内容的提问来源于stack exchange,提问作者Suyog Chadawar

火山引擎 最新活动