You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python提取特定PDF文本遇阻,求提取方案或异常文档检测方法

Solution 1: Extract Text from the Problematic PDF in Python

Your PDF’s issue is almost certainly tied to non-standard font embedding or broken character encoding mappings in later pages—tools like pdfminer and pdftotext struggle with these edge cases. Here are two reliable fixes using Python:

Approach 1: Use PyMuPDF (fitz)

PyMuPDF has a far more robust text extraction engine than many popular libraries, and it often handles wonky font setups that trip up other tools.

First, install it:

pip install pymupdf

Then use this script to extract text:

import fitz

def extract_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Extract text with PyMuPDF's default method
        text = page.get_text()
        # Clean up any non-printable garbage characters
        cleaned_text = ''.join([c for c in text if c.isprintable() or c in '\n\t'])
        full_text.append(cleaned_text)
    return '\n'.join(full_text)

# Test it with your PDF
pdf_path = "GNC 2013 Final Program.pdf"
extracted_text = extract_pdf_text(pdf_path)
print(extracted_text[:500])  # Verify the first few hundred characters

In most cases, this will pull clean text even from the pages that pdftotext garbled. If some pages still come out messy, they might be scanned images rather than selectable text—move to the next approach.

Approach 2: Combine PyMuPDF with OCR for Image Pages

If parts of the PDF are scanned images (even if other pages are text), use OCR to recover content from those pages. First install the required libraries:

pip install pymupdf pytesseract pillow

Then use this hybrid extraction script:

import fitz
import pytesseract
from PIL import Image

def extract_text_with_ocr(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()
        # If the page has almost no selectable text, it's likely an image
        if len(text.strip()) < 100:
            # Render the page as an image
            pix = page.get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            # Run OCR to extract text from the image
            ocr_text = pytesseract.image_to_string(img)
            full_text.append(ocr_text)
        else:
            # Clean up and add the selectable text
            cleaned_text = ''.join([c for c in text if c.isprintable() or c in '\n\t'])
            full_text.append(cleaned_text)
    return '\n'.join(full_text)

Solution 2: Detect Problematic PDFs Before Parsing

To avoid wasting time on PDFs that will fail extraction, implement these quick checks to flag problematic files upfront:

Check 1: Scan for Garbled Text in Sample Pages

Extract text from a few key pages (start, middle, end) and check for excessive non-printable characters:

import fitz
import string

def is_pdf_parseable(pdf_path):
    doc = fitz.open(pdf_path)
    # Check first, middle, and last pages
    sample_pages = [0, len(doc)//2, len(doc)-1]
    valid_chars = set(string.printable)
    
    for page_num in sample_pages:
        page = doc.load_page(page_num)
        text = page.get_text()
        if not text:
            continue
        # Calculate percentage of unprintable characters
        unprintable_ratio = sum(1 for c in text if c not in valid_chars) / len(text)
        # If more than 10% of characters are unprintable, flag as problematic
        if unprintable_ratio > 0.1:
            return False
    return True

Check 2: Inspect for Non-Embedded Fonts

Problematic PDFs often use non-embedded or custom fonts that break extraction. Check for this:

import fitz

def has_suspect_fonts(pdf_path):
    doc = fitz.open(pdf_path)
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        fonts = page.get_fonts()
        for font in fonts:
            # font[3] is a boolean indicating if the font is embedded
            if not font[3]:
                return True
            # Flag custom/unknown font names
            if "Custom" in font[1] or "Unknown" in font[1]:
                return True
    return False

Use these checks together: if either returns True, you can skip attempting full extraction or flag the PDF for manual review.


内容的提问来源于stack exchange,提问作者blackfireize

火山引擎 最新活动