Django技术问询:无Word环境下Python实现Doc/Docx转PDF
嘿,我之前部署Django项目时就踩过依赖Word环境做文档转换的坑——服务器装Office不仅麻烦还容易出权限问题,给你几个无Word依赖的可行方案,亲测好用:
方案1:针对Docx的轻量转换(python-docx + ReportLab)
如果你的文档以Docx为主,纯Python依赖的方案最省心:用python-docx读取内容,再用ReportLab生成PDF。缺点是会丢失复杂格式(比如表格、图片),适合纯文本类文档:
from docx import Document from reportlab.pdfgen import canvas from io import BytesIO def docx_to_pdf(docx_file): doc = Document(docx_file) pdf_buffer = BytesIO() c = canvas.Canvas(pdf_buffer) # 设置基础字体与起始位置 c.setFont("Helvetica", 10) y_pos = 750 # PDF页面从上往下的起始坐标 for para in doc.paragraphs: if para.text.strip(): c.drawString(50, y_pos, para.text) y_pos -= 15 # 自动分页处理 if y_pos < 50: c.showPage() c.setFont("Helvetica", 10) y_pos = 750 c.save() pdf_buffer.seek(0) return pdf_buffer
方案2:全格式支持的命令行方案(LibreOffice)
要兼容Doc和Docx,LibreOffice是最佳选择——跨平台、免费,支持几乎所有Office格式转PDF。
先在服务器上安装LibreOffice(Ubuntu:sudo apt install libreoffice;CentOS:sudo yum install libreoffice),再用subprocess调用命令行转换:
import subprocess import os from io import BytesIO def doc_to_pdf(input_file_path): output_path = input_file_path.replace('.doc', '.pdf').replace('.docx', '.pdf') # 无界面模式调用LibreOffice转换 cmd = [ "libreoffice", "--headless", "--convert-to", "pdf", "--outdir", os.path.dirname(output_path), input_file_path ] subprocess.run(cmd, check=True, capture_output=True) # 读取转换后的PDF到内存缓冲区 with open(output_path, 'rb') as f: pdf_buffer = BytesIO(f.read()) # 清理临时文件(按需保留) os.remove(output_path) return pdf_buffer
方案3:简化LibreOffice调用的unoconv
unoconv是LibreOffice的封装工具,调用更简洁,还支持直接从文件流转换,不用写临时文件到磁盘:
先安装unoconv(Ubuntu:sudo apt install unoconv),然后代码:
import subprocess from io import BytesIO def doc_to_pdf_stream(input_file): # 从stdin传入文件内容,stdout直接获取PDF输出 process = subprocess.Popen( ["unoconv", "-f", "pdf", "--stdout"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE ) stdout, stderr = process.communicate(input=input_file.read()) if process.returncode != 0: raise Exception(f"转换失败: {stderr.decode()}") return BytesIO(stdout)
合并PDF的收尾步骤
不管用哪种转换方案,最后用PyPDF2合并所有PDF:
from PyPDF2 import PdfMerger def merge_pdfs(pdf_buffers): merger = PdfMerger() for buffer in pdf_buffers: merger.append(buffer) merged_buffer = BytesIO() merger.write(merged_buffer) merger.close() merged_buffer.seek(0) return merged_buffer
在Django视图里整合流程示例:
def deliver_merged_pdf(request): # 1. 从DMS获取文档列表(假设是临时文件对象) dms_docs = get_dms_documents() # 2. 逐个转换为PDF pdf_buffers = [] for doc in dms_docs: if doc.name.endswith(('.doc', '.docx')): if doc.name.endswith('.docx'): pdf_buf = docx_to_pdf(doc) else: pdf_buf = doc_to_pdf(doc.temporary_file_path()) elif doc.name.endswith('.pdf'): pdf_buf = BytesIO(doc.read()) else: # 忽略不支持的格式或抛出错误 continue pdf_buffers.append(pdf_buf) # 3. 合并并返回给用户 merged_pdf = merge_pdfs(pdf_buffers) response = HttpResponse(merged_pdf, content_type='application/pdf') response['Content-Disposition'] = 'attachment; filename="merged_docs.pdf"' return response
注意事项
- 服务器运行LibreOffice要确保权限足够,避免进程卡死
- 大文件转换优先用临时文件,减少内存占用
- 云服务器(比如AWS Lambda)需要特殊配置LibreOffice运行环境
内容的提问来源于stack exchange,提问作者pseudoku




