Python按字体大小拆分PDF:适配pdfplumber与pyPdf坐标问题求助
Let's tackle your problem from two practical angles: using an all-in-one library that avoids coordinate headaches, and fixing the coordinate mismatch if you want to stick with pdfplumber + PyPDF.
Option 1: Use PyMuPDF (fitz) – The All-in-One Solution
PyMuPDF is hands down the best tool for this job right now. It lets you extract text with full font properties (size, position, family) and manipulate PDF pages/regions seamlessly, all with a unified coordinate system (no conversion hoops needed).
Here's a quick example that splits a PDF by extracting regions containing text of a specific font size:
import fitz # PyMuPDF def split_pdf_by_font_size(input_path, output_path, target_font_size): doc = fitz.open(input_path) output_doc = fitz.open() for page_num in range(len(doc)): page = doc[page_num] # Extract all text blocks with detailed font info text_blocks = page.get_text("dict")["blocks"] for block in text_blocks: if "lines" not in block: continue # Check if any line in the block uses our target font size has_target_size = any( any(span["size"] == target_font_size for span in line["spans"]) for line in block["lines"] ) if has_target_size: # Get the bounding box of the matching block rect = fitz.Rect(block["bbox"]) # Create a new page sized to the block new_page = output_doc.new_page(width=rect.width, height=rect.height) # Copy the target region from original to new page new_page.show_pdf_page(new_page.rect, doc, page_num, clip=rect) output_doc.save(output_path) output_doc.close() doc.close() # Usage example split_pdf_by_font_size("input.pdf", "output.pdf", 12.0)
This code pulls out every block of text that includes your target font size, creates dedicated pages for those blocks, and saves them into a clean new PDF. No coordinate confusion at all!
Option 2: Fix Coordinate Mismatch Between pdfplumber and PyPDF
If you prefer sticking with pdfplumber and PyPDF, the issue almost always boils down to page rotation handling. pdfplumber automatically normalizes coordinates to match the "readable" orientation of the page (so (0,0) is the bottom-left of the page as you see it), while PyPDF uses the original page's raw coordinate system (which might be rotated).
Here's how to convert pdfplumber's coordinates to PyPDF's system:
- First, grab the page rotation and dimensions from pdfplumber
- Adjust the bounding box based on the rotation angle to match PyPDF's raw coordinates
Here's a full working example:
import pdfplumber from PyPDF2 import PdfReader, PdfWriter def convert_coords(pdfplumber_bbox, page_width, page_height, rotation): x0, top, x1, bottom = pdfplumber_bbox # Convert based on page rotation if rotation == 0: return (x0, bottom, x1, top) elif rotation == 90: return (bottom, page_width - x1, top, page_width - x0) elif rotation == 180: return (page_width - x1, page_height - bottom, page_width - x0, page_height - top) elif rotation == 270: return (page_height - top, x0, page_height - bottom, x1) else: return (x0, bottom, x1, top) def split_with_pdfplumber_pypdf(input_path, output_path, target_font_size): reader = PdfReader(input_path) writer = PdfWriter() with pdfplumber.open(input_path) as pdf: for page_num in range(len(reader.pages)): pdf_page = pdf.pages[page_num] pypdf_page = reader.pages[page_num] rotation = pdf_page.rotation page_height = pdf_page.height page_width = pdf_page.width # Extract all characters matching our target font size matching_chars = [char for char in pdf_page.chars if char["size"] == target_font_size] if not matching_chars: continue # Calculate the bounding box that covers all matching characters x0 = min(c["x0"] for c in matching_chars) top = max(c["top"] for c in matching_chars) x1 = max(c["x1"] for c in matching_chars) bottom = min(c["bottom"] for c in matching_chars) pdfplumber_bbox = (x0, top, x1, bottom) # Convert to PyPDF's coordinate system crop_box = convert_coords(pdfplumber_bbox, page_width, page_height, rotation) pypdf_page.cropbox = crop_box # Add the cropped page to our output writer.add_page(pypdf_page) with open(output_path, "wb") as out_file: writer.write(out_file) # Usage example split_with_pdfplumber_pypdf("input.pdf", "output.pdf", 12.0)
This code handles rotation properly, so your cropped regions will now align perfectly with what you see in pdfplumber's character positions.
内容的提问来源于stack exchange,提问作者Suyog Chadawar




