You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python读取并自动填充德铁延迟理赔PDF表单遇解密问题求助

解决德国铁路理赔PDF表单的读取与加密问题

为什么浏览器能编辑但Python库报错?

很多PDF会设置权限加密(而非文档打开密码),限制编辑、打印等操作,但浏览器的PDF阅读器通常会忽略这些权限限制,直接允许交互编辑。而PyPDF2、pdftk这类工具严格遵循PDF规范,必须先解密才能访问表单字段,哪怕没有设置打开密码。

解决方案1:修复PyPDF2代码,添加解密步骤

你的PyPDF2代码缺少解密环节,即使PDF没有设置打开密码,也需要调用decrypt()方法(传入空字符串)来解锁权限。修改后的代码如下:

# -*- coding: utf-8 -*-
from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader

def _getFields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.
    :param fileobj: A file object (usually a text file) to write a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each value is a :class:`Field<PyPDF2.generic.Field>` object. By default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name', '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
    catalog = obj.trailer["/Root"]
    # get the AcroForm tree
    if "/AcroForm" in catalog:
        tree = catalog["/AcroForm"]
    else:
        return None
    if tree is None:
        return retval
    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break
    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)
    return retval

def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    # 添加解密步骤:尝试空密码解锁权限
    if infile.isEncrypted:
        infile.decrypt('')
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())

if __name__ == '__main__':
    from pprint import pprint
    pdf_file_name = 'fahrgastrechteformular.pdf'
    pprint(get_form_fields(pdf_file_name))

解决方案2:使用更友好的PDF库——PyMuPDF(fitz)

PyMuPDF处理加密PDF的兼容性更好,无需复杂的解密操作,直接读取表单字段。先安装库:

pip install pymupdf

读取表单字段的代码示例:

import fitz

def get_form_fields_pymupdf(pdf_path):
    doc = fitz.open(pdf_path)
    form_fields = {}
    # 遍历每一页的表单字段
    for page in doc:
        for field in page.widgets():
            # 记录字段名称和当前值
            form_fields[field.field_name] = field.field_value
    doc.close()
    return form_fields

if __name__ == '__main__':
    from pprint import pprint
    pdf_file_name = 'fahrgastrechteformular.pdf'
    pprint(get_form_fields_pymupdf(pdf_file_name))

这个方法不仅能读取字段值,后续填充表单也很方便,只需要设置field.field_value后调用doc.save()即可。

关于pdftk的问题

你执行pdftk命令时打错了参数(dum_data_fields应该是dump_data_fields),即使修正后,pdftk对权限加密的PDF也需要指定allow参数来忽略权限,正确命令如下:

pdftk.exe fahrgastrechteformular.pdf dump_data_fields allow All

但相比之下,Python库的方案更适合自动化流程。

后续填充表单的建议

如果后续需要填充表单,PyMuPDF是更高效的选择,示例代码(填充字段并保存):

import fitz

def fill_pdf_form(pdf_path, output_path, data_dict):
    doc = fitz.open(pdf_path)
    for page in doc:
        for field in page.widgets():
            if field.field_name in data_dict:
                field.field_value = data_dict[field.field_name]
                # 更新字段外观,确保内容显示正常
                field.update()
    doc.save(output_path)
    doc.close()

# 使用示例
fill_data = {
    "field_name_1": "柏林中央车站",
    # 替换为实际的字段名称和对应值
}
fill_pdf_form("fahrgastrechteformular.pdf", "filled_form.pdf", fill_data)

内容的提问来源于stack exchange,提问作者jay_pee_

火山引擎 最新活动