You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python将PDF转换为.docx?现有代码返回Output[1]遇问题

Fixing PDF to DOCX Conversion Issues in Python

Hey there, let's tackle your PDF-to-DOCX conversion problem. I notice you're using LibreOffice's lowriter via subprocess, but running into exceptions. Let's go through potential fixes for your current approach, plus a simpler pure-Python alternative that might save you headaches.

Issues with Your Current Code

Your code has a few common pitfalls that could be causing the errors:

  • Path escaping & shell risks: Using shell=True with formatted strings can break if your file paths have spaces or special characters. It also introduces security risks if the folder contains untrusted filenames.
  • Lack of error handling: You aren't catching exceptions or checking the exit code of the lowriter process, so you can't see exactly why it's failing (e.g., lowriter not found, corrupted PDF, permission issues).
  • Incomplete conversion command: You didn't specify an output directory, which might lead to files being saved in unexpected places.

Fixed Version of Your LibreOffice Approach

Here's a revised version of your code that addresses these issues:

import os
import subprocess
from subprocess import CalledProcessError

pdf_folder = '/my/pdf/folder'

for top, dirs, files in os.walk(pdf_folder):
    for filename in files:
        # Match case-insensitively for PDFs
        if filename.lower().endswith('.pdf'):
            abspath = os.path.join(top, filename)
            output_dir = os.path.dirname(abspath)
            
            try:
                # Use a list of arguments instead of shell=True to avoid escaping issues
                subprocess.check_call([
                    'lowriter',
                    '--invisible',
                    '--convert-to', 'docx',
                    abspath,
                    '--outdir', output_dir
                ])
                print(f"✅ Successfully converted: {abspath}")
            except CalledProcessError as e:
                print(f"❌ Failed to convert {abspath}: Process returned error code {e.returncode}")
            except FileNotFoundError:
                print("❌ Error: 'lowriter' not found. Ensure LibreOffice is installed and added to your system PATH.")

Key improvements here:

  • Uses a list of arguments instead of shell=True to handle spaces/special characters safely.
  • Adds error handling to catch common issues like missing lowriter or failed conversions.
  • Explicitly sets the output directory to match the original PDF's location.

Alternative: Pure-Python Conversion with pdf2docx

If you don't want to rely on LibreOffice, the pdf2docx library is a great pure-Python alternative. It handles most standard PDFs well and integrates seamlessly into your Python script.

Step 1: Install the library

pip install pdf2docx

Step 2: Conversion Code

import os
from pdf2docx import Converter

pdf_folder = '/my/pdf/folder'

for top, dirs, files in os.walk(pdf_folder):
    for filename in files:
        if filename.lower().endswith('.pdf'):
            abspath = os.path.join(top, filename)
            output_path = os.path.splitext(abspath)[0] + '.docx'
            
            try:
                # Initialize converter and run conversion
                converter = Converter(abspath)
                converter.convert(output_path, start=0, end=None)  # Convert all pages
                converter.close()
                print(f"✅ Converted: {abspath} → {output_path}")
            except Exception as e:
                print(f"❌ Error converting {abspath}: {str(e)}")

Pros & Cons of pdf2docx

  • Pros: No external dependencies, easy to install, full control within Python, works cross-platform.
  • Cons: May struggle with highly complex PDFs (e.g., multi-column layouts, embedded charts, scanned text—note: scanned PDFs are images, so you'd need OCR first).

Give these approaches a try, and you should be able to resolve your conversion issues. If you still hit specific errors, sharing the exact exception message would help narrow things down further!

内容的提问来源于stack exchange,提问作者Also

火山引擎 最新活动