You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何通过坐标从PDF提取特定文本?Python脚本开发求助

Extract Text from Specific Coordinates in PDF with PyMuPDF

Got it, let's tackle this reverse problem—you need to pull text from a specific coordinate area in a PDF, and the tools you've tried so far only go the other way (finding coordinates from text). Good news: PyMuPDF (fitz) has exactly what you need, you just need to use a different method than searchFor().

The Fix: Use page.get_text("words")

This method returns every word in the page along with its bounding box coordinates. Each entry is a tuple formatted like:
(x0, y0, x1, y1, text, block_no, line_no, word_no)
Where:

  • x0, y0: Top-left corner of the word's bounding box
  • x1, y1: Bottom-right corner
  • text: The actual word itself

Example Implementation

Here's how to use it to extract text from a target coordinate region:

import fitz  # PyMuPDF

# Load your PDF
doc = fitz.Document("pdf_name.pdf")
page = doc.load_page(0)  # Load first page (index 0)

# Define your target coordinate area (adjust these values to your needs)
# Let's use the sample coordinates you mentioned: (90.0, 145.85) to (142.13, 156.50)
target_rect = fitz.Rect(90.0, 145.85, 142.13, 156.50)

# Get all words with their coordinates
all_words = page.get_text("words")

# Collect words that lie inside or overlap with the target rectangle
extracted_text = []
for word in all_words:
    word_rect = fitz.Rect(word[0], word[1], word[2], word[3])
    # Check if the word's rectangle intersects with our target area
    if word_rect.intersects(target_rect):
        extracted_text.append(word[4])

# Join the words into a single string
final_text = " ".join(extracted_text)
print(final_text)

Customize the Matching Logic

Depending on your needs, you can tweak how you check the word's position:

  • Use word_rect in target_rect if you want only words fully contained within the target area
  • Use target_rect in word_rect if you want words that fully contain the target area (useful if your target is a small point inside a larger text block)
  • intersects() works for partial overlaps, which is often the most flexible option

Why This Works

Unlike PyPDF2 or pdfminer.six, PyMuPDF gives you granular access to every text element's spatial data. By iterating through all words and comparing their bounding boxes to your target coordinates, you can precisely extract the text you need.

内容的提问来源于stack exchange,提问作者Damiano Shehaj

火山引擎 最新活动