Python读取并自动填充德铁延迟理赔PDF表单遇解密问题求助
解决德国铁路理赔PDF表单的读取与加密问题
为什么浏览器能编辑但Python库报错?
很多PDF会设置权限加密(而非文档打开密码),限制编辑、打印等操作,但浏览器的PDF阅读器通常会忽略这些权限限制,直接允许交互编辑。而PyPDF2、pdftk这类工具严格遵循PDF规范,必须先解密才能访问表单字段,哪怕没有设置打开密码。
解决方案1:修复PyPDF2代码,添加解密步骤
你的PyPDF2代码缺少解密环节,即使PDF没有设置打开密码,也需要调用decrypt()方法(传入空字符串)来解锁权限。修改后的代码如下:
# -*- coding: utf-8 -*- from collections import OrderedDict from PyPDF2 import PdfFileWriter, PdfFileReader def _getFields(obj, tree=None, retval=None, fileobj=None): """ Extracts field data if this PDF contains interactive form fields. The *tree* and *retval* parameters are for recursive use. :param fileobj: A file object (usually a text file) to write a report to on all interactive form fields found. :return: A dictionary where each key is a field name, and each value is a :class:`Field<PyPDF2.generic.Field>` object. By default, the mapping name is used for keys. :rtype: dict, or ``None`` if form data could not be located. """ fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name', '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'} if retval is None: retval = OrderedDict() catalog = obj.trailer["/Root"] # get the AcroForm tree if "/AcroForm" in catalog: tree = catalog["/AcroForm"] else: return None if tree is None: return retval obj._checkKids(tree, retval, fileobj) for attr in fieldAttributes: if attr in tree: # Tree is a field obj._buildField(tree, retval, fileobj, fieldAttributes) break if "/Fields" in tree: fields = tree["/Fields"] for f in fields: field = f.getObject() obj._buildField(field, retval, fileobj, fieldAttributes) return retval def get_form_fields(infile): infile = PdfFileReader(open(infile, 'rb')) # 添加解密步骤:尝试空密码解锁权限 if infile.isEncrypted: infile.decrypt('') fields = _getFields(infile) return OrderedDict((k, v.get('/V', '')) for k, v in fields.items()) if __name__ == '__main__': from pprint import pprint pdf_file_name = 'fahrgastrechteformular.pdf' pprint(get_form_fields(pdf_file_name))
解决方案2:使用更友好的PDF库——PyMuPDF(fitz)
PyMuPDF处理加密PDF的兼容性更好,无需复杂的解密操作,直接读取表单字段。先安装库:
pip install pymupdf
读取表单字段的代码示例:
import fitz def get_form_fields_pymupdf(pdf_path): doc = fitz.open(pdf_path) form_fields = {} # 遍历每一页的表单字段 for page in doc: for field in page.widgets(): # 记录字段名称和当前值 form_fields[field.field_name] = field.field_value doc.close() return form_fields if __name__ == '__main__': from pprint import pprint pdf_file_name = 'fahrgastrechteformular.pdf' pprint(get_form_fields_pymupdf(pdf_file_name))
这个方法不仅能读取字段值,后续填充表单也很方便,只需要设置field.field_value后调用doc.save()即可。
关于pdftk的问题
你执行pdftk命令时打错了参数(dum_data_fields应该是dump_data_fields),即使修正后,pdftk对权限加密的PDF也需要指定allow参数来忽略权限,正确命令如下:
pdftk.exe fahrgastrechteformular.pdf dump_data_fields allow All
但相比之下,Python库的方案更适合自动化流程。
后续填充表单的建议
如果后续需要填充表单,PyMuPDF是更高效的选择,示例代码(填充字段并保存):
import fitz def fill_pdf_form(pdf_path, output_path, data_dict): doc = fitz.open(pdf_path) for page in doc: for field in page.widgets(): if field.field_name in data_dict: field.field_value = data_dict[field.field_name] # 更新字段外观,确保内容显示正常 field.update() doc.save(output_path) doc.close() # 使用示例 fill_data = { "field_name_1": "柏林中央车站", # 替换为实际的字段名称和对应值 } fill_pdf_form("fahrgastrechteformular.pdf", "filled_form.pdf", fill_data)
内容的提问来源于stack exchange,提问作者jay_pee_




