如何基于重复关键词拆分大型.docx文件为多个子.docx文件？

阿华AIGC实验室

2026-5-25

Split Large .docx File into Smaller Files Using Repeated Keyword

I’ll help you complete the solution to split your .docx file based on the repeating ABCD: keyword. We’ll use the python-docx library to read and write Word documents—this is the most straightforward tool for this task.

Step 1: Install Required Library

First, make sure you have python-docx installed. If not, run this command in your terminal:

pip install python-docx

Step 2: Complete the Split Function

Here’s the full, working implementation with comments to walk you through each step:

from docx import Document

def split_doc_by_keyword(input_doc_path, keyword="ABCD:"):
    """
    Splits a large .docx file into smaller files where each new file starts with the target keyword.
    Files are named ABCD1.docx, ABCD2.docx, etc., preserving original paragraph formatting.
    """
    # Load the original document
    original_doc = Document(input_doc_path)
    
    sections = []
    current_section = []
    
    # Iterate through every paragraph in the original file
    for para in original_doc.paragraphs:
        # Check if this paragraph starts with our keyword (ignoring leading/trailing spaces)
        if para.text.strip().startswith(keyword):
            # If we're already building a section, save it to the list first
            if current_section:
                sections.append(current_section)
                current_section = []
            # Start a new section with this paragraph
            current_section.append(para)
        else:
            # If we're in the middle of a section, add this paragraph to it
            if current_section:
                current_section.append(para)
    
    # Add the final section after the loop ends
    if current_section:
        sections.append(current_section)
    
    # Write each section to a new .docx file
    base_name = keyword.strip(':').strip()
    for idx, section in enumerate(sections, start=1):
        new_doc = Document()
        # Copy each paragraph (and its formatting) to the new file
        for para in section:
            new_para = new_doc.add_paragraph()
            new_para.text = para.text
            # Preserve original styling (bold, italic, font size, etc.)
            new_para.style = para.style
        
        # Save the new document
        file_name = f"{base_name}{idx}.docx"
        new_doc.save(file_name)
        print(f"Successfully created: {file_name}")

# Example usage (replace with your actual file path)
if __name__ == "__main__":
    split_doc_by_keyword("your_large_file.docx")

How It Works:

Reading the Original File: We load the input .docx and loop through every paragraph.
Grouping Sections: Every time we hit a paragraph starting with ABCD:, we wrap up the current section (if one exists) and start a new one. All subsequent paragraphs are added to this section until the next ABCD: is found.
Writing New Files: Each section is saved to a separate .docx file with a sequential name (ABCD1, ABCD2, etc.). We also copy over paragraph formatting to keep the look of your original content intact.

Quick Adjustments You Might Need:

Case Insensitivity: If your keyword might appear in different cases (e.g., abcd: or AbCd:), change the check to para.text.strip().upper().startswith(keyword.upper()).
Handling Complex Content: If your document includes tables, images, or bullet points, you’ll need to extend the code to copy those elements too—let me know if you need help with that!

内容的提问来源于stack exchange，提问作者Vaibhav Tyagi