PyPDF2提取PDF文本时split()方法单词拆分异常的技术求助

阿华AIGC实验室

2026-5-15

Hey there! Let's break down what's going wrong with your PDF processing code and fix it step by step.

Core Issues Identified

1. Variable Overwriting Causes Data Loss

Looking at your PDF code, you're reusing the textfile variable for two completely different purposes:

textfile = open(f, 'rb')  # First, it's a PDF file object
text = PyPDF2.PdfFileReader(textfile)
for pageNum in range(0, text.numPages):
    texts = text.getPage(pageNum)
    textfile = texts.extractText().split()  # Now it's a list of tokens from the current page!

This leads to two big problems:

All text from previous pages gets discarded, so only the last page's content is processed
You lose the reference to the original file object (a bad practice even if it doesn't break things here)

2. PyPDF2's Text Extraction Limitations

PyPDF2's extractText() is notoriously unreliable for complex PDF layouts. It often mangles text by splitting words across lines, introducing non-standard spaces, or leaving hidden control characters. When you call split() directly on this messy output, it breaks words into random fragments like 'w','o','rld'.

Fixes to Try

Fix 1: Correct Variable Logic & Optimize PyPDF2 Text Handling

First, fix the variable overwriting and add text cleaning to reduce splitting issues:

import string, re, os
import PyPDF2

# Read category vocabulary list
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()

dic = {}
scores = {}

# Initialize score dictionary
current_category = "Default"
scores[current_category] = 0
for line in lines:
    if line[0:2] == '>>':
        current_category = line[2:].strip()
        scores[current_category] = 0
    else:
        line = line.strip()
        if len(line) > 0:
            pattern = re.compile(line, re.IGNORECASE)
            dic[pattern] = current_category

# Process PDF files
i = 2011
while i < 2012:
    f = 'annual_report_' + str(i) +'.pdf'
    # Use a dedicated variable for the file object, and with statement for auto-closing
    with open(f, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfFileReader(pdf_file)
        all_tokens = []  # Collect tokens from ALL pages
        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            page_text = page.extractText()
            # Clean messy text: replace all whitespace (newlines, tabs, multiple spaces) with single spaces
            cleaned_text = re.sub(r'\s+', ' ', page_text).strip()
            page_tokens = cleaned_text.split()
            all_tokens.extend(page_tokens)
    
    # Count vocabulary matches
    for token in all_tokens:
        for pattern in dic.keys():
            if pattern.match(token):
                categ = dic[pattern]
                scores[categ] += 1
    
    print(os.path.basename(f))
    for key in scores.keys():
        print(f"{key}: {scores[key]}")
    i += 1

Key improvements here:

Uses a with statement to safely manage the PDF file (auto-closes when done)
Collects tokens from every page into a single all_tokens list
Cleans messy whitespace before splitting, which reduces random word fragmentation

Fix 2: Switch to a More Reliable PDF Extraction Library (Recommended)

If PyPDF2 still doesn't extract text correctly, use pdfplumber—it's designed to handle complex layouts and extract text more accurately.

First install it:

pip install pdfplumber

Then modify the PDF processing section:

import pdfplumber

# ... Keep the vocabulary reading and score initialization code the same ...

i = 2011
while i < 2012:
    f = 'annual_report_' + str(i) +'.pdf'
    all_tokens = []
    with pdfplumber.open(f) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:  # Skip empty pages to avoid errors
                cleaned_text = re.sub(r'\s+', ' ', page_text).strip()
                page_tokens = cleaned_text.split()
                all_tokens.extend(page_tokens)
    
    # ... Keep the vocabulary counting code the same ...

pdfplumber does a much better job preserving word boundaries, which should eliminate the random splitting issue entirely.

Extra Tip: Refine Your Matching Logic

Your current pattern.match(token) only matches from the start of the word. If you want to match words that contain your target term (not just start with it), use pattern.search(token) instead, or adjust your regex to include wildcards (e.g., re.compile(r'.*' + line + '.*', re.IGNORECASE)).

内容的提问来源于stack exchange，提问作者AnEma