PyPDF2提取PDF文本时split()方法单词拆分异常的技术求助
Hey there! Let's break down what's going wrong with your PDF processing code and fix it step by step.
Core Issues Identified
1. Variable Overwriting Causes Data Loss
Looking at your PDF code, you're reusing the textfile variable for two completely different purposes:
textfile = open(f, 'rb') # First, it's a PDF file object text = PyPDF2.PdfFileReader(textfile) for pageNum in range(0, text.numPages): texts = text.getPage(pageNum) textfile = texts.extractText().split() # Now it's a list of tokens from the current page!
This leads to two big problems:
- All text from previous pages gets discarded, so only the last page's content is processed
- You lose the reference to the original file object (a bad practice even if it doesn't break things here)
2. PyPDF2's Text Extraction Limitations
PyPDF2's extractText() is notoriously unreliable for complex PDF layouts. It often mangles text by splitting words across lines, introducing non-standard spaces, or leaving hidden control characters. When you call split() directly on this messy output, it breaks words into random fragments like 'w','o','rld'.
Fixes to Try
Fix 1: Correct Variable Logic & Optimize PyPDF2 Text Handling
First, fix the variable overwriting and add text cleaning to reduce splitting issues:
import string, re, os import PyPDF2 # Read category vocabulary list dictfile = open('list.txt') lines = dictfile.readlines() dictfile.close() dic = {} scores = {} # Initialize score dictionary current_category = "Default" scores[current_category] = 0 for line in lines: if line[0:2] == '>>': current_category = line[2:].strip() scores[current_category] = 0 else: line = line.strip() if len(line) > 0: pattern = re.compile(line, re.IGNORECASE) dic[pattern] = current_category # Process PDF files i = 2011 while i < 2012: f = 'annual_report_' + str(i) +'.pdf' # Use a dedicated variable for the file object, and with statement for auto-closing with open(f, 'rb') as pdf_file: pdf_reader = PyPDF2.PdfFileReader(pdf_file) all_tokens = [] # Collect tokens from ALL pages for page_num in range(pdf_reader.numPages): page = pdf_reader.getPage(page_num) page_text = page.extractText() # Clean messy text: replace all whitespace (newlines, tabs, multiple spaces) with single spaces cleaned_text = re.sub(r'\s+', ' ', page_text).strip() page_tokens = cleaned_text.split() all_tokens.extend(page_tokens) # Count vocabulary matches for token in all_tokens: for pattern in dic.keys(): if pattern.match(token): categ = dic[pattern] scores[categ] += 1 print(os.path.basename(f)) for key in scores.keys(): print(f"{key}: {scores[key]}") i += 1
Key improvements here:
- Uses a
withstatement to safely manage the PDF file (auto-closes when done) - Collects tokens from every page into a single
all_tokenslist - Cleans messy whitespace before splitting, which reduces random word fragmentation
Fix 2: Switch to a More Reliable PDF Extraction Library (Recommended)
If PyPDF2 still doesn't extract text correctly, use pdfplumber—it's designed to handle complex layouts and extract text more accurately.
First install it:
pip install pdfplumber
Then modify the PDF processing section:
import pdfplumber # ... Keep the vocabulary reading and score initialization code the same ... i = 2011 while i < 2012: f = 'annual_report_' + str(i) +'.pdf' all_tokens = [] with pdfplumber.open(f) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: # Skip empty pages to avoid errors cleaned_text = re.sub(r'\s+', ' ', page_text).strip() page_tokens = cleaned_text.split() all_tokens.extend(page_tokens) # ... Keep the vocabulary counting code the same ...
pdfplumber does a much better job preserving word boundaries, which should eliminate the random splitting issue entirely.
Extra Tip: Refine Your Matching Logic
Your current pattern.match(token) only matches from the start of the word. If you want to match words that contain your target term (not just start with it), use pattern.search(token) instead, or adjust your regex to include wildcards (e.g., re.compile(r'.*' + line + '.*', re.IGNORECASE)).
内容的提问来源于stack exchange,提问作者AnEma




