提升Python扫描PDF图像关键词识别准确率的技术咨询

阿华AIGC实验室

2026-5-8

Boosting OCR Accuracy for Old Scanned PDFs

Hey there! Those early 2000s scanned PDFs are notoriously tough for OCR—low resolution, faded text, and occasional handwritten notes can throw even the best tools off. Let's walk through practical tweaks to get your accuracy up to that 75% goal (and maybe even higher!).

1. Fix the Image First: Preprocessing is Key

OCR tools rely on clear, high-contrast text. Old scans often have noise, blurriness, or uneven lighting—fixing these before OCR makes a huge difference. Try these steps with PIL/OpenCV:

Grayscale Conversion: Strip color to reduce distractions:
```
image = image.convert('L')
```

Auto-Contrast Enhancement: Make text pop against faded backgrounds:

from PIL import ImageOps
image = ImageOps.autocontrast(image, cutoff=2)  # Cutoff removes extreme outliers

Binarization (Thresholding): Turn the image into black-and-white to eliminate gray areas:

image = image.point(lambda x: 0 if x < 140 else 255, '1')  # Adjust threshold based on your PDFs

Denoising: Smooth out speckles with a median filter:

image = image.filter(ImageFilter.MedianFilter(size=3))

Deskewing: If pages are tilted, use Tesseract's built-in orientation detection or OpenCV's Hough transform to straighten them—even a small tilt can wreck OCR.

2. Tune Tesseract for Your Use Case

Pytesseract uses default settings that don't always work for old documents. Customize these parameters to target your keyword format (letters + hyphen + 3 digits):

Character Whitelist: Tell Tesseract to only recognize relevant characters—this cuts down on random errors:

custom_config = r'-c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-'

Page Segmentation Mode (PSM): For single-column text (common in old forms), use psm=6 (assumes a single uniform text block). If there's scattered text or handwriting, try psm=11 (sparse text):
```
custom_config += r' --psm 6'
```
Enable LSTM Model: Ensure you're using Tesseract 4+ (which uses LSTM by default) with --oem 3 (default) for better text recognition, especially for slightly distorted print.
Update Tesseract: Old versions have worse accuracy—grab the latest release for free from the official repo.

3. Improve PDF-to-Image Conversion

PyMuPDF's default settings might be outputting low-res images. Crank up the DPI to match standard scan quality:

pix = doc.load_page(page_num).get_pixmap(dpi=300)  # 300 DPI is standard for scanned docs

This makes text edges sharper, which Tesseract can parse much easier.

4. Use Fuzzy Matching Instead of Exact Matches

OCR might misread a character (e.g., b-002 becomes b-003 or d-002). Instead of strict equality, use partial fuzzy matching to catch near-misses:

Install fuzzywuzzy (and python-Levenshtein for speed):
```
pip install fuzzywuzzy python-Levenshtein
```

Check for partial matches with a threshold (adjust based on your needs):

from fuzzywuzzy import fuzz

FUZZ_THRESHOLD = 80
for keyword in KEYWORDS:
    if fuzz.partial_ratio(keyword.lower(), text.lower()) >= FUZZ_THRESHOLD:
        return True

This will catch cases where OCR makes small typos but the core keyword structure is intact.

5. Try Alternative OCR Tools (If Tesseract Still Falls Short)

If you're still struggling with handwritten content or extremely faded text, give these a shot:

EasyOCR: A wrapper around Tesseract that adds better preprocessing and supports more languages out of the box. It’s great for mixed print/handwriting.
Google Vision OCR: (Requires an API key) Excellent for handwritten text and low-quality scans, but it’s cloud-based (so you need internet access).

Example Optimized Code

Here’s how to put all these pieces together:

import fitz  # PyMuPDF
from PIL import Image, ImageOps, ImageFilter
import pytesseract
from fuzzywuzzy import fuzz

KEYWORDS = ["a-001", "b-002", "c-003"]
FUZZ_THRESHOLD = 80

def preprocess_image(image):
    gray = image.convert('L')
    contrast = ImageOps.autocontrast(gray, cutoff=2)
    threshold = contrast.point(lambda x: 0 if x < 140 else 255, '1')
    denoised = threshold.filter(ImageFilter.MedianFilter(size=3))
    return denoised

def pdf_contains_keyword(pdf_path):
    doc = fitz.open(pdf_path)
    custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-'
    
    for page_num in range(len(doc)):
        # Extract high-res image
        pix = doc.load_page(page_num).get_pixmap(dpi=300)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        
        # Preprocess the image
        processed_img = preprocess_image(img)
        
        # Run OCR with custom settings
        text = pytesseract.image_to_string(processed_img, config=custom_config)
        
        # Check for fuzzy matches
        for keyword in KEYWORDS:
            if fuzz.partial_ratio(keyword.lower(), text.lower()) >= FUZZ_THRESHOLD:
                doc.close()
                return True
    doc.close()
    return False

# Test with your PDF
print(pdf_contains_keyword("b.pdf"))

Start with these changes—tweak the threshold values, DPI, and PSM mode based on your specific PDFs. You’ll likely see a big jump in accuracy without needing perfect OCR for handwritten content.

内容的提问来源于stack exchange，提问作者ysga