Python批量转换Docx为Txt脚本仅生成19个文件异常求助
Hey there! Let's figure out why your Python script is bailing after only 19 files, especially since it gets stuck on 1.docx Reading 1.... Here are the most likely culprits and fixes to try:
1. The Problem File: 1.docx Is Probably Corrupted or Has Hidden Issues
Even though you said there are no images/tables, OCR-generated docs often have weird hidden formatting, broken paragraph markers, or invisible characters that trip up the python-docx library. Let's test this file in isolation first:
Create a tiny script that only processes 1.docx:
from docx import Document # Test just the problematic file doc = Document("1.docx") with open("test_1.txt", "w", encoding="utf-8") as f: for idx, para in enumerate(doc.paragraphs): print(f"Processing paragraph {idx}: {para.text[:50]}...") # Print a snippet of each paragraph f.write(para.text + "\n")
Run this. If it freezes on a specific paragraph, you've found the issue. To fix it:
- Open
1.docxin Microsoft Word, go to File > Save As and save it as a new docx file (this often repairs minor corruption). - If that doesn't work, copy all the text manually into a new blank docx, then try again.
2. Your Batch Loop Might Be Getting Stuck Indefinitely (Not Crashing)
If python-docx hits a problematic file, it might not throw an error—it could just hang. To prevent this from stopping your entire batch, add a timeout for each file's processing using threads:
import os from docx import Document from concurrent.futures import ThreadPoolExecutor, TimeoutError def process_single_docx(filename): if not filename.lower().endswith(".docx"): return print(f"Starting {filename}...") try: doc = Document(filename) txt_filename = os.path.splitext(filename)[0] + ".txt" with open(txt_filename, "w", encoding="utf-8", errors="replace") as txt_file: for para in doc.paragraphs: txt_file.write(para.text + "\n") print(f"Successfully converted {filename}") except Exception as e: print(f"Failed to convert {filename}: {str(e)}") # Use a single-threaded executor to process files one at a time, with a timeout with ThreadPoolExecutor(max_workers=1) as executor: for filename in os.listdir("."): if filename.lower().endswith(".docx"): future = executor.submit(process_single_docx, filename) try: # Wait 30 seconds for each file; adjust if needed future.result(timeout=30) except TimeoutError: print(f"⚠️ Timeout processing {filename} - this file is likely corrupted. Skipping...")
This way, if a file hangs, the script will skip it after 30 seconds and keep processing the rest.
3. Check for Loop Logic Mistakes
Double-check your original batch script to make sure:
- You're not accidentally using a
breakorreturnstatement that exits the loop early. - You're correctly iterating over all docx files in the directory (e.g.,
os.listdir()doesn't miss any, or you're not filtering files incorrectly). - You're handling filenames with spaces or special characters (though you said they're numbered, so this is less likely).
4. Fix Potential Encoding Hiccups
OCR might insert odd characters that cause issues when writing to the txt file. Adding errors="replace" to your open() call (like in the script above) ensures that unreadable characters are replaced with ? instead of crashing or hanging the script.
Start with testing the problematic 1.docx first—nine times out of ten, that's the root cause. Once you fix or skip that file, your batch should process all your docs.
内容的提问来源于stack exchange,提问作者k.b




