You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python解析大型DOCX文件并提取出现N次的关键词/字符串?

Hey there! Parsing large, structurally variable DOCX files to build a word/string frequency database is totally doable with the right Python tools and strategies. Let’s walk through everything you need to know.

1. Essential Python Libraries for DOCX Processing
  • python-docx: The go-to library for working with DOCX files. It lets you access not just raw text, but also document components like paragraphs, tables, headers/footers, and even text boxes. Perfect if you need to handle structured content (e.g., extracting text from table cells that might hold critical terms).
    • Pro tip: Use doc.paragraphs to iterate through regular text, and loop through doc.tables to extract content from table cells with table.cell(row_idx, col_idx).text.
  • python-docx2txt: A lighter alternative that skips structural details and directly extracts all plain text from DOCX files. Great if you don’t care about where the text comes from and just need a quick dump of content.
  • textract: A more heavy-duty option that can handle not just DOCX, but other formats too (PDF, Excel, etc.). It’s useful if your files have embedded objects or complex formatting, but note that it requires extra dependencies (like Poppler for PDF handling) to install.
2. Strategies for Variable Document Structures

Since your files don’t have a fixed structure, you’ll need a robust way to capture all possible text sources:

  • Traverse all content containers: Don’t just stop at paragraphs. Make sure to check tables, headers, footers, and even text boxes (python-docx can access text boxes via doc.inline_shapes if they’re embedded).
  • Clean your text consistently:
    • Normalize case: Convert all text to lowercase (or uppercase) to avoid counting "Word" and "word" as separate entries.
    • Remove noise: Strip out special characters, extra whitespace, and non-printable characters using regex (e.g., re.sub(r'[^\w\s]', '', text) to keep words and spaces).
    • Handle hyphenation: If your documents have hyphenated words split across lines (e.g., "soft-\nware"), use regex to join them into a single word.
  • Filter irrelevant terms: Use a stopword list (from libraries like nltk or spaCy) to exclude common words like "the", "and", or "is" that don’t add value to your frequency analysis. For custom terms you want to exclude, add them to your own stopword set.
  • Flexible string matching: If you’re looking for specific multi-word strings or patterns (not just single words), use regex to identify and extract them. For example, re.findall(r'\b[A-Z]{2,}-\d{3}\b', text) could capture codes like "ABC-123".
3. Building the Frequency Database

Once you have clean text, here’s how to build your database:

  • Count frequencies: Use Python’s built-in collections.Counter to tally word/string occurrences. For multi-file analysis, track counts per document as well as total counts.
  • Choose a database:
    • SQLite: Ideal for small to medium-sized datasets. It’s file-based, no server required, and integrates seamlessly with Python’s sqlite3 module. You can create a table with columns like term, total_frequency, and source_files (to track which documents the term appears in).
    • Pandas + CSV/Parquet: If you want to prototype quickly, use Pandas to store frequency data in a DataFrame, then export it to CSV or Parquet. Later, you can easily import this into a relational database if needed.
  • Example database setup:
    import sqlite3
    
    conn = sqlite3.connect('term_frequencies.db')
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS term_counts (
            term TEXT PRIMARY KEY,
            total_count INTEGER NOT NULL,
            source_docs TEXT
        )
    ''')
    conn.commit()
    
4. Quick Example Workflow

Here’s a snippet that ties it all together using python-docx and collections.Counter:

import docx
from collections import Counter
import re
import sqlite3

def clean_text(text):
    # Lowercase, remove special chars, strip whitespace
    text = text.lower().strip()
    text = re.sub(r'[^\w\s]', '', text)
    # Split into words (adjust if you need multi-word strings)
    return text.split()

def process_docx(file_path):
    doc = docx.Document(file_path)
    all_words = []
    
    # Process paragraphs
    for para in doc.paragraphs:
        all_words.extend(clean_text(para.text))
    
    # Process tables
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                all_words.extend(clean_text(cell.text))
    
    # Return Counter for this file
    return Counter(all_words), file_path

# Process multiple files
file_list = ['doc1.docx', 'doc2.docx', 'doc3.docx']
total_counter = Counter()
doc_term_map = {}

for file in file_list:
    file_counter, fname = process_docx(file)
    total_counter.update(file_counter)
    # Track which docs each term appears in
    for term in file_counter:
        if term not in doc_term_map:
            doc_term_map[term] = []
        doc_term_map[term].append(fname)

# Insert into SQLite database
conn = sqlite3.connect('term_frequencies.db')
cursor = conn.cursor()

for term, count in total_counter.items():
    source_docs = ', '.join(doc_term_map[term])
    cursor.execute('''
        INSERT OR REPLACE INTO term_counts (term, total_count, source_docs)
        VALUES (?, ?, ?)
    ''', (term, count, source_docs))

conn.commit()
conn.close()

Final notes: If you’re dealing with extremely large DOCX files (100k+ words), consider processing them in chunks to avoid memory issues. Also, for multi-word string frequency, adjust the clean_text function to use regex matching instead of splitting on whitespace.

内容的提问来源于stack exchange,提问作者micshapicsha

火山引擎 最新活动