Python批量转换Docx为Txt脚本仅生成19个文件异常求助

阿华AIGC实验室

2026-5-20

Troubleshooting Your Docx-to-Txt Batch Script Stuck at 19 Files

Hey there! Let's figure out why your Python script is bailing after only 19 files, especially since it gets stuck on 1.docx Reading 1.... Here are the most likely culprits and fixes to try:

1. The Problem File: `1.docx` Is Probably Corrupted or Has Hidden Issues

Even though you said there are no images/tables, OCR-generated docs often have weird hidden formatting, broken paragraph markers, or invisible characters that trip up the python-docx library. Let's test this file in isolation first:

Create a tiny script that only processes 1.docx:

from docx import Document

# Test just the problematic file
doc = Document("1.docx")
with open("test_1.txt", "w", encoding="utf-8") as f:
    for idx, para in enumerate(doc.paragraphs):
        print(f"Processing paragraph {idx}: {para.text[:50]}...")  # Print a snippet of each paragraph
        f.write(para.text + "\n")

Run this. If it freezes on a specific paragraph, you've found the issue. To fix it:

Open 1.docx in Microsoft Word, go to File > Save As and save it as a new docx file (this often repairs minor corruption).
If that doesn't work, copy all the text manually into a new blank docx, then try again.

2. Your Batch Loop Might Be Getting Stuck Indefinitely (Not Crashing)

If python-docx hits a problematic file, it might not throw an error—it could just hang. To prevent this from stopping your entire batch, add a timeout for each file's processing using threads:

import os
from docx import Document
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def process_single_docx(filename):
    if not filename.lower().endswith(".docx"):
        return
    print(f"Starting {filename}...")
    try:
        doc = Document(filename)
        txt_filename = os.path.splitext(filename)[0] + ".txt"
        with open(txt_filename, "w", encoding="utf-8", errors="replace") as txt_file:
            for para in doc.paragraphs:
                txt_file.write(para.text + "\n")
        print(f"Successfully converted {filename}")
    except Exception as e:
        print(f"Failed to convert {filename}: {str(e)}")

# Use a single-threaded executor to process files one at a time, with a timeout
with ThreadPoolExecutor(max_workers=1) as executor:
    for filename in os.listdir("."):
        if filename.lower().endswith(".docx"):
            future = executor.submit(process_single_docx, filename)
            try:
                # Wait 30 seconds for each file; adjust if needed
                future.result(timeout=30)
            except TimeoutError:
                print(f"⚠️ Timeout processing {filename} - this file is likely corrupted. Skipping...")

This way, if a file hangs, the script will skip it after 30 seconds and keep processing the rest.

3. Check for Loop Logic Mistakes

Double-check your original batch script to make sure:

You're not accidentally using a break or return statement that exits the loop early.
You're correctly iterating over all docx files in the directory (e.g., os.listdir() doesn't miss any, or you're not filtering files incorrectly).
You're handling filenames with spaces or special characters (though you said they're numbered, so this is less likely).

4. Fix Potential Encoding Hiccups

OCR might insert odd characters that cause issues when writing to the txt file. Adding errors="replace" to your open() call (like in the script above) ensures that unreadable characters are replaced with ? instead of crashing or hanging the script.

Start with testing the problematic 1.docx first—nine times out of ten, that's the root cause. Once you fix or skip that file, your batch should process all your docs.

内容的提问来源于stack exchange，提问作者k.b