Python提取特定PDF文本遇阻,求提取方案或异常文档检测方法
Your PDF’s issue is almost certainly tied to non-standard font embedding or broken character encoding mappings in later pages—tools like pdfminer and pdftotext struggle with these edge cases. Here are two reliable fixes using Python:
Approach 1: Use PyMuPDF (fitz)
PyMuPDF has a far more robust text extraction engine than many popular libraries, and it often handles wonky font setups that trip up other tools.
First, install it:
pip install pymupdf
Then use this script to extract text:
import fitz def extract_pdf_text(pdf_path): doc = fitz.open(pdf_path) full_text = [] for page_num in range(len(doc)): page = doc.load_page(page_num) # Extract text with PyMuPDF's default method text = page.get_text() # Clean up any non-printable garbage characters cleaned_text = ''.join([c for c in text if c.isprintable() or c in '\n\t']) full_text.append(cleaned_text) return '\n'.join(full_text) # Test it with your PDF pdf_path = "GNC 2013 Final Program.pdf" extracted_text = extract_pdf_text(pdf_path) print(extracted_text[:500]) # Verify the first few hundred characters
In most cases, this will pull clean text even from the pages that pdftotext garbled. If some pages still come out messy, they might be scanned images rather than selectable text—move to the next approach.
Approach 2: Combine PyMuPDF with OCR for Image Pages
If parts of the PDF are scanned images (even if other pages are text), use OCR to recover content from those pages. First install the required libraries:
pip install pymupdf pytesseract pillow
Then use this hybrid extraction script:
import fitz import pytesseract from PIL import Image def extract_text_with_ocr(pdf_path): doc = fitz.open(pdf_path) full_text = [] for page_num in range(len(doc)): page = doc.load_page(page_num) text = page.get_text() # If the page has almost no selectable text, it's likely an image if len(text.strip()) < 100: # Render the page as an image pix = page.get_pixmap() img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) # Run OCR to extract text from the image ocr_text = pytesseract.image_to_string(img) full_text.append(ocr_text) else: # Clean up and add the selectable text cleaned_text = ''.join([c for c in text if c.isprintable() or c in '\n\t']) full_text.append(cleaned_text) return '\n'.join(full_text)
To avoid wasting time on PDFs that will fail extraction, implement these quick checks to flag problematic files upfront:
Check 1: Scan for Garbled Text in Sample Pages
Extract text from a few key pages (start, middle, end) and check for excessive non-printable characters:
import fitz import string def is_pdf_parseable(pdf_path): doc = fitz.open(pdf_path) # Check first, middle, and last pages sample_pages = [0, len(doc)//2, len(doc)-1] valid_chars = set(string.printable) for page_num in sample_pages: page = doc.load_page(page_num) text = page.get_text() if not text: continue # Calculate percentage of unprintable characters unprintable_ratio = sum(1 for c in text if c not in valid_chars) / len(text) # If more than 10% of characters are unprintable, flag as problematic if unprintable_ratio > 0.1: return False return True
Check 2: Inspect for Non-Embedded Fonts
Problematic PDFs often use non-embedded or custom fonts that break extraction. Check for this:
import fitz def has_suspect_fonts(pdf_path): doc = fitz.open(pdf_path) for page_num in range(len(doc)): page = doc.load_page(page_num) fonts = page.get_fonts() for font in fonts: # font[3] is a boolean indicating if the font is embedded if not font[3]: return True # Flag custom/unknown font names if "Custom" in font[1] or "Unknown" in font[1]: return True return False
Use these checks together: if either returns True, you can skip attempting full extraction or flag the PDF for manual review.
内容的提问来源于stack exchange,提问作者blackfireize




