You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

寻求本地开源工具:将含手写/复选框的扫描PDF表单转为JSON

扫描表单PDF转结构化JSON的本地开源解决方案

核心技术栈选型

针对无固定布局、含手写文本+复选框的扫描PDF,优先采用多模态OCR+结构化解析的方案,兼顾准确性与本地部署需求:

  • 底层PDF转图像:Poppler(开源本地工具)
  • OCR与元素识别:LayoutLMv3(开源多模态模型,支持CPU/8G GPU)或 Tesseract 5 + OpenCV(轻量CPU方案)
  • 结构化转换:自定义Python代码实现字段映射与JSON生成

步骤1:扫描PDF转图像

先用Poppler的pdftoppm工具将PDF每页转为PNG图像,确保后续OCR能处理:

# 安装Poppler后执行,将input.pdf每页转为output-1.png、output-2.png等
pdftoppm -png -r 300 input.pdf output

步骤2:OCR与表单元素识别

方案A:LayoutLMv3(推荐,适配无固定布局)

利用多模态模型直接识别字段、文本内容及复选框状态,无需依赖固定布局规则,8G显存可运行base版本:

from transformers import LayoutLMv3ForTokenClassification, LayoutLMv3Processor
import torch
from PIL import Image

# 加载预训练的表单识别模型(可自行微调适配你的表单)
model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base-finetuned-form")
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base-finetuned-form")

# 加载图像
image = Image.open("output-1.png").convert("RGB")
encoding = processor(image, return_tensors="pt")

# GPU加速(若有NVIDIA显卡)
if torch.cuda.is_available():
    model = model.cuda()
    encoding = {k: v.cuda() for k, v in encoding.items()}

# 推理
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()

# 解析结果:将模型输出的标签映射为字段名、文本值、复选框状态
results = {}
current_key = ""
for token, pred in zip(encoding["input_ids"][0], predictions):
    label = model.config.id2label[pred]
    text = processor.decode([token]).strip()
    if label == "QUESTION" and text:
        current_key = text
    elif label == "ANSWER_TEXT" and current_key and text:
        results[current_key] = text
    elif label == "CHECKBOX_SELECTED" and current_key:
        results[current_key] = True
    elif label == "CHECKBOX_UNSELECTED" and current_key:
        results[current_key] = False

方案B:Tesseract 5 + OpenCV(轻量CPU方案)

适合资源有限的场景,通过Tesseract提取文本,OpenCV检测复选框填充状态:

import cv2
import pytesseract
import json

# 预处理图像:二值化增强对比度
image = cv2.imread("output-1.png")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Tesseract提取所有文本及位置
ocr_data = pytesseract.image_to_data(thresh, output_type=pytesseract.Output.DICT)

# OpenCV检测复选框轮廓(根据实际尺寸调整阈值)
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
checkboxes = []
for cnt in contours:
    x, y, w, h = cv2.boundingRect(cnt)
    if 20 < w < 50 and 20 < h < 50:
        # 判断是否填充:计算非空白像素占比
        roi = thresh[y:y+h, x:x+w]
        filled_pixels = cv2.countNonZero(roi)
        is_selected = filled_pixels > (w*h)*0.3
        checkboxes.append({"x": x, "y": y, "selected": is_selected})

# 关联文本与复选框:根据位置匹配问题与状态
results = {}
previous_question = ""
for i in range(len(ocr_data["text"])):
    text = ocr_data["text"][i].strip()
    if text and "?" in text:
        previous_question = text
        # 找到同区域的复选框
        question_y = ocr_data["top"][i]
        closest_cb = min(checkboxes, key=lambda cb: abs(cb["y"] - question_y))
        results[text] = closest_cb["selected"]
    elif text and previous_question:
        # 处理手写文本答案
        results[previous_question] = text

# 保存为JSON
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

步骤3:统一字段映射与JSON标准化

提前定义固定字段模板,确保输出JSON的键名完全统一:

# 定义业务所需的固定字段模板
FIELD_TEMPLATE = {
    "姓名": "",
    "年龄": "",
    "是否同意服务条款": False,
    "是否有过敏史": False
}

# 将识别结果映射到模板(可搭配模糊匹配库如fuzzywuzzy优化匹配)
standardized_results = FIELD_TEMPLATE.copy()
for key, value in results.items():
    if "姓名" in key:
        standardized_results["姓名"] = value
    elif "同意条款" in key:
        standardized_results["是否同意服务条款"] = value
    elif "过敏史" in key:
        standardized_results["是否有过敏史"] = value
    elif "年龄" in key:
        standardized_results["年龄"] = value

# 保存标准化JSON
with open("standardized_output.json", "w", encoding="utf-8") as f:
    json.dump(standardized_results, f, ensure_ascii=False, indent=2)

性能优化建议

  • CPU模式:优先用Tesseract+OpenCV,LayoutLMv3 CPU推理需设置batch_size=1,速度较慢但可运行
  • GPU模式:LayoutLMv3 base模型在8G显存下可设置batch_size=2,启用PyTorch CUDA加速大幅提升处理效率
  • 手写文本优化:若手写识别效果差,可搭配开源手写OCR模型如CRNN,或微调LayoutLMv3时加入手写表单数据集

内容的提问来源于stack exchange,提问作者tobiasBora

火山引擎 最新活动