如何用Python将PDF转换为.docx?现有代码返回Output[1]遇问题
Hey there, let's tackle your PDF-to-DOCX conversion problem. I notice you're using LibreOffice's lowriter via subprocess, but running into exceptions. Let's go through potential fixes for your current approach, plus a simpler pure-Python alternative that might save you headaches.
Issues with Your Current Code
Your code has a few common pitfalls that could be causing the errors:
- Path escaping & shell risks: Using
shell=Truewith formatted strings can break if your file paths have spaces or special characters. It also introduces security risks if the folder contains untrusted filenames. - Lack of error handling: You aren't catching exceptions or checking the exit code of the
lowriterprocess, so you can't see exactly why it's failing (e.g.,lowriternot found, corrupted PDF, permission issues). - Incomplete conversion command: You didn't specify an output directory, which might lead to files being saved in unexpected places.
Fixed Version of Your LibreOffice Approach
Here's a revised version of your code that addresses these issues:
import os import subprocess from subprocess import CalledProcessError pdf_folder = '/my/pdf/folder' for top, dirs, files in os.walk(pdf_folder): for filename in files: # Match case-insensitively for PDFs if filename.lower().endswith('.pdf'): abspath = os.path.join(top, filename) output_dir = os.path.dirname(abspath) try: # Use a list of arguments instead of shell=True to avoid escaping issues subprocess.check_call([ 'lowriter', '--invisible', '--convert-to', 'docx', abspath, '--outdir', output_dir ]) print(f"✅ Successfully converted: {abspath}") except CalledProcessError as e: print(f"❌ Failed to convert {abspath}: Process returned error code {e.returncode}") except FileNotFoundError: print("❌ Error: 'lowriter' not found. Ensure LibreOffice is installed and added to your system PATH.")
Key improvements here:
- Uses a list of arguments instead of
shell=Trueto handle spaces/special characters safely. - Adds error handling to catch common issues like missing
lowriteror failed conversions. - Explicitly sets the output directory to match the original PDF's location.
Alternative: Pure-Python Conversion with pdf2docx
If you don't want to rely on LibreOffice, the pdf2docx library is a great pure-Python alternative. It handles most standard PDFs well and integrates seamlessly into your Python script.
Step 1: Install the library
pip install pdf2docx
Step 2: Conversion Code
import os from pdf2docx import Converter pdf_folder = '/my/pdf/folder' for top, dirs, files in os.walk(pdf_folder): for filename in files: if filename.lower().endswith('.pdf'): abspath = os.path.join(top, filename) output_path = os.path.splitext(abspath)[0] + '.docx' try: # Initialize converter and run conversion converter = Converter(abspath) converter.convert(output_path, start=0, end=None) # Convert all pages converter.close() print(f"✅ Converted: {abspath} → {output_path}") except Exception as e: print(f"❌ Error converting {abspath}: {str(e)}")
Pros & Cons of pdf2docx
- Pros: No external dependencies, easy to install, full control within Python, works cross-platform.
- Cons: May struggle with highly complex PDFs (e.g., multi-column layouts, embedded charts, scanned text—note: scanned PDFs are images, so you'd need OCR first).
Give these approaches a try, and you should be able to resolve your conversion issues. If you still hit specific errors, sharing the exact exception message would help narrow things down further!
内容的提问来源于stack exchange,提问作者Also




