You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何基于重复关键词拆分大型.docx文件为多个子.docx文件?

Split Large .docx File into Smaller Files Using Repeated Keyword

I’ll help you complete the solution to split your .docx file based on the repeating ABCD: keyword. We’ll use the python-docx library to read and write Word documents—this is the most straightforward tool for this task.

Step 1: Install Required Library

First, make sure you have python-docx installed. If not, run this command in your terminal:

pip install python-docx

Step 2: Complete the Split Function

Here’s the full, working implementation with comments to walk you through each step:

from docx import Document

def split_doc_by_keyword(input_doc_path, keyword="ABCD:"):
    """
    Splits a large .docx file into smaller files where each new file starts with the target keyword.
    Files are named ABCD1.docx, ABCD2.docx, etc., preserving original paragraph formatting.
    """
    # Load the original document
    original_doc = Document(input_doc_path)
    
    sections = []
    current_section = []
    
    # Iterate through every paragraph in the original file
    for para in original_doc.paragraphs:
        # Check if this paragraph starts with our keyword (ignoring leading/trailing spaces)
        if para.text.strip().startswith(keyword):
            # If we're already building a section, save it to the list first
            if current_section:
                sections.append(current_section)
                current_section = []
            # Start a new section with this paragraph
            current_section.append(para)
        else:
            # If we're in the middle of a section, add this paragraph to it
            if current_section:
                current_section.append(para)
    
    # Add the final section after the loop ends
    if current_section:
        sections.append(current_section)
    
    # Write each section to a new .docx file
    base_name = keyword.strip(':').strip()
    for idx, section in enumerate(sections, start=1):
        new_doc = Document()
        # Copy each paragraph (and its formatting) to the new file
        for para in section:
            new_para = new_doc.add_paragraph()
            new_para.text = para.text
            # Preserve original styling (bold, italic, font size, etc.)
            new_para.style = para.style
        
        # Save the new document
        file_name = f"{base_name}{idx}.docx"
        new_doc.save(file_name)
        print(f"Successfully created: {file_name}")

# Example usage (replace with your actual file path)
if __name__ == "__main__":
    split_doc_by_keyword("your_large_file.docx")

How It Works:

  1. Reading the Original File: We load the input .docx and loop through every paragraph.
  2. Grouping Sections: Every time we hit a paragraph starting with ABCD:, we wrap up the current section (if one exists) and start a new one. All subsequent paragraphs are added to this section until the next ABCD: is found.
  3. Writing New Files: Each section is saved to a separate .docx file with a sequential name (ABCD1, ABCD2, etc.). We also copy over paragraph formatting to keep the look of your original content intact.

Quick Adjustments You Might Need:

  • Case Insensitivity: If your keyword might appear in different cases (e.g., abcd: or AbCd:), change the check to para.text.strip().upper().startswith(keyword.upper()).
  • Handling Complex Content: If your document includes tables, images, or bullet points, you’ll need to extend the code to copy those elements too—let me know if you need help with that!

内容的提问来源于stack exchange,提问作者Vaibhav Tyagi

火山引擎 最新活动