如何用Python解析大型DOCX文件并提取出现N次的关键词/字符串?
Hey there! Parsing large, structurally variable DOCX files to build a word/string frequency database is totally doable with the right Python tools and strategies. Let’s walk through everything you need to know.
- python-docx: The go-to library for working with DOCX files. It lets you access not just raw text, but also document components like paragraphs, tables, headers/footers, and even text boxes. Perfect if you need to handle structured content (e.g., extracting text from table cells that might hold critical terms).
- Pro tip: Use
doc.paragraphsto iterate through regular text, and loop throughdoc.tablesto extract content from table cells withtable.cell(row_idx, col_idx).text.
- Pro tip: Use
- python-docx2txt: A lighter alternative that skips structural details and directly extracts all plain text from DOCX files. Great if you don’t care about where the text comes from and just need a quick dump of content.
- textract: A more heavy-duty option that can handle not just DOCX, but other formats too (PDF, Excel, etc.). It’s useful if your files have embedded objects or complex formatting, but note that it requires extra dependencies (like Poppler for PDF handling) to install.
Since your files don’t have a fixed structure, you’ll need a robust way to capture all possible text sources:
- Traverse all content containers: Don’t just stop at paragraphs. Make sure to check tables, headers, footers, and even text boxes (python-docx can access text boxes via
doc.inline_shapesif they’re embedded). - Clean your text consistently:
- Normalize case: Convert all text to lowercase (or uppercase) to avoid counting "Word" and "word" as separate entries.
- Remove noise: Strip out special characters, extra whitespace, and non-printable characters using regex (e.g.,
re.sub(r'[^\w\s]', '', text)to keep words and spaces). - Handle hyphenation: If your documents have hyphenated words split across lines (e.g., "soft-\nware"), use regex to join them into a single word.
- Filter irrelevant terms: Use a stopword list (from libraries like
nltkorspaCy) to exclude common words like "the", "and", or "is" that don’t add value to your frequency analysis. For custom terms you want to exclude, add them to your own stopword set. - Flexible string matching: If you’re looking for specific multi-word strings or patterns (not just single words), use regex to identify and extract them. For example,
re.findall(r'\b[A-Z]{2,}-\d{3}\b', text)could capture codes like "ABC-123".
Once you have clean text, here’s how to build your database:
- Count frequencies: Use Python’s built-in
collections.Counterto tally word/string occurrences. For multi-file analysis, track counts per document as well as total counts. - Choose a database:
- SQLite: Ideal for small to medium-sized datasets. It’s file-based, no server required, and integrates seamlessly with Python’s
sqlite3module. You can create a table with columns liketerm,total_frequency, andsource_files(to track which documents the term appears in). - Pandas + CSV/Parquet: If you want to prototype quickly, use Pandas to store frequency data in a DataFrame, then export it to CSV or Parquet. Later, you can easily import this into a relational database if needed.
- SQLite: Ideal for small to medium-sized datasets. It’s file-based, no server required, and integrates seamlessly with Python’s
- Example database setup:
import sqlite3 conn = sqlite3.connect('term_frequencies.db') cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS term_counts ( term TEXT PRIMARY KEY, total_count INTEGER NOT NULL, source_docs TEXT ) ''') conn.commit()
Here’s a snippet that ties it all together using python-docx and collections.Counter:
import docx from collections import Counter import re import sqlite3 def clean_text(text): # Lowercase, remove special chars, strip whitespace text = text.lower().strip() text = re.sub(r'[^\w\s]', '', text) # Split into words (adjust if you need multi-word strings) return text.split() def process_docx(file_path): doc = docx.Document(file_path) all_words = [] # Process paragraphs for para in doc.paragraphs: all_words.extend(clean_text(para.text)) # Process tables for table in doc.tables: for row in table.rows: for cell in row.cells: all_words.extend(clean_text(cell.text)) # Return Counter for this file return Counter(all_words), file_path # Process multiple files file_list = ['doc1.docx', 'doc2.docx', 'doc3.docx'] total_counter = Counter() doc_term_map = {} for file in file_list: file_counter, fname = process_docx(file) total_counter.update(file_counter) # Track which docs each term appears in for term in file_counter: if term not in doc_term_map: doc_term_map[term] = [] doc_term_map[term].append(fname) # Insert into SQLite database conn = sqlite3.connect('term_frequencies.db') cursor = conn.cursor() for term, count in total_counter.items(): source_docs = ', '.join(doc_term_map[term]) cursor.execute(''' INSERT OR REPLACE INTO term_counts (term, total_count, source_docs) VALUES (?, ?, ?) ''', (term, count, source_docs)) conn.commit() conn.close()
Final notes: If you’re dealing with extremely large DOCX files (100k+ words), consider processing them in chunks to avoid memory issues. Also, for multi-word string frequency, adjust the clean_text function to use regex matching instead of splitting on whitespace.
内容的提问来源于stack exchange,提问作者micshapicsha




