如何基于重复关键词拆分大型.docx文件为多个子.docx文件?
Split Large .docx File into Smaller Files Using Repeated Keyword
I’ll help you complete the solution to split your .docx file based on the repeating ABCD: keyword. We’ll use the python-docx library to read and write Word documents—this is the most straightforward tool for this task.
Step 1: Install Required Library
First, make sure you have python-docx installed. If not, run this command in your terminal:
pip install python-docx
Step 2: Complete the Split Function
Here’s the full, working implementation with comments to walk you through each step:
from docx import Document def split_doc_by_keyword(input_doc_path, keyword="ABCD:"): """ Splits a large .docx file into smaller files where each new file starts with the target keyword. Files are named ABCD1.docx, ABCD2.docx, etc., preserving original paragraph formatting. """ # Load the original document original_doc = Document(input_doc_path) sections = [] current_section = [] # Iterate through every paragraph in the original file for para in original_doc.paragraphs: # Check if this paragraph starts with our keyword (ignoring leading/trailing spaces) if para.text.strip().startswith(keyword): # If we're already building a section, save it to the list first if current_section: sections.append(current_section) current_section = [] # Start a new section with this paragraph current_section.append(para) else: # If we're in the middle of a section, add this paragraph to it if current_section: current_section.append(para) # Add the final section after the loop ends if current_section: sections.append(current_section) # Write each section to a new .docx file base_name = keyword.strip(':').strip() for idx, section in enumerate(sections, start=1): new_doc = Document() # Copy each paragraph (and its formatting) to the new file for para in section: new_para = new_doc.add_paragraph() new_para.text = para.text # Preserve original styling (bold, italic, font size, etc.) new_para.style = para.style # Save the new document file_name = f"{base_name}{idx}.docx" new_doc.save(file_name) print(f"Successfully created: {file_name}") # Example usage (replace with your actual file path) if __name__ == "__main__": split_doc_by_keyword("your_large_file.docx")
How It Works:
- Reading the Original File: We load the input .docx and loop through every paragraph.
- Grouping Sections: Every time we hit a paragraph starting with
ABCD:, we wrap up the current section (if one exists) and start a new one. All subsequent paragraphs are added to this section until the nextABCD:is found. - Writing New Files: Each section is saved to a separate .docx file with a sequential name (ABCD1, ABCD2, etc.). We also copy over paragraph formatting to keep the look of your original content intact.
Quick Adjustments You Might Need:
- Case Insensitivity: If your keyword might appear in different cases (e.g.,
abcd:orAbCd:), change the check topara.text.strip().upper().startswith(keyword.upper()). - Handling Complex Content: If your document includes tables, images, or bullet points, you’ll need to extend the code to copy those elements too—let me know if you need help with that!
内容的提问来源于stack exchange,提问作者Vaibhav Tyagi




