寻求本地开源工具:将含手写/复选框的扫描PDF表单转为JSON
扫描表单PDF转结构化JSON的本地开源解决方案
核心技术栈选型
针对无固定布局、含手写文本+复选框的扫描PDF,优先采用多模态OCR+结构化解析的方案,兼顾准确性与本地部署需求:
- 底层PDF转图像:Poppler(开源本地工具)
- OCR与元素识别:LayoutLMv3(开源多模态模型,支持CPU/8G GPU)或 Tesseract 5 + OpenCV(轻量CPU方案)
- 结构化转换:自定义Python代码实现字段映射与JSON生成
步骤1:扫描PDF转图像
先用Poppler的pdftoppm工具将PDF每页转为PNG图像,确保后续OCR能处理:
# 安装Poppler后执行,将input.pdf每页转为output-1.png、output-2.png等 pdftoppm -png -r 300 input.pdf output
步骤2:OCR与表单元素识别
方案A:LayoutLMv3(推荐,适配无固定布局)
利用多模态模型直接识别字段、文本内容及复选框状态,无需依赖固定布局规则,8G显存可运行base版本:
from transformers import LayoutLMv3ForTokenClassification, LayoutLMv3Processor import torch from PIL import Image # 加载预训练的表单识别模型(可自行微调适配你的表单) model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base-finetuned-form") processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base-finetuned-form") # 加载图像 image = Image.open("output-1.png").convert("RGB") encoding = processor(image, return_tensors="pt") # GPU加速(若有NVIDIA显卡) if torch.cuda.is_available(): model = model.cuda() encoding = {k: v.cuda() for k, v in encoding.items()} # 推理 outputs = model(**encoding) predictions = outputs.logits.argmax(-1).squeeze().tolist() # 解析结果:将模型输出的标签映射为字段名、文本值、复选框状态 results = {} current_key = "" for token, pred in zip(encoding["input_ids"][0], predictions): label = model.config.id2label[pred] text = processor.decode([token]).strip() if label == "QUESTION" and text: current_key = text elif label == "ANSWER_TEXT" and current_key and text: results[current_key] = text elif label == "CHECKBOX_SELECTED" and current_key: results[current_key] = True elif label == "CHECKBOX_UNSELECTED" and current_key: results[current_key] = False
方案B:Tesseract 5 + OpenCV(轻量CPU方案)
适合资源有限的场景,通过Tesseract提取文本,OpenCV检测复选框填充状态:
import cv2 import pytesseract import json # 预处理图像:二值化增强对比度 image = cv2.imread("output-1.png") gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] # Tesseract提取所有文本及位置 ocr_data = pytesseract.image_to_data(thresh, output_type=pytesseract.Output.DICT) # OpenCV检测复选框轮廓(根据实际尺寸调整阈值) contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) checkboxes = [] for cnt in contours: x, y, w, h = cv2.boundingRect(cnt) if 20 < w < 50 and 20 < h < 50: # 判断是否填充:计算非空白像素占比 roi = thresh[y:y+h, x:x+w] filled_pixels = cv2.countNonZero(roi) is_selected = filled_pixels > (w*h)*0.3 checkboxes.append({"x": x, "y": y, "selected": is_selected}) # 关联文本与复选框:根据位置匹配问题与状态 results = {} previous_question = "" for i in range(len(ocr_data["text"])): text = ocr_data["text"][i].strip() if text and "?" in text: previous_question = text # 找到同区域的复选框 question_y = ocr_data["top"][i] closest_cb = min(checkboxes, key=lambda cb: abs(cb["y"] - question_y)) results[text] = closest_cb["selected"] elif text and previous_question: # 处理手写文本答案 results[previous_question] = text # 保存为JSON with open("output.json", "w", encoding="utf-8") as f: json.dump(results, f, ensure_ascii=False, indent=2)
步骤3:统一字段映射与JSON标准化
提前定义固定字段模板,确保输出JSON的键名完全统一:
# 定义业务所需的固定字段模板 FIELD_TEMPLATE = { "姓名": "", "年龄": "", "是否同意服务条款": False, "是否有过敏史": False } # 将识别结果映射到模板(可搭配模糊匹配库如fuzzywuzzy优化匹配) standardized_results = FIELD_TEMPLATE.copy() for key, value in results.items(): if "姓名" in key: standardized_results["姓名"] = value elif "同意条款" in key: standardized_results["是否同意服务条款"] = value elif "过敏史" in key: standardized_results["是否有过敏史"] = value elif "年龄" in key: standardized_results["年龄"] = value # 保存标准化JSON with open("standardized_output.json", "w", encoding="utf-8") as f: json.dump(standardized_results, f, ensure_ascii=False, indent=2)
性能优化建议
- CPU模式:优先用Tesseract+OpenCV,LayoutLMv3 CPU推理需设置
batch_size=1,速度较慢但可运行 - GPU模式:LayoutLMv3 base模型在8G显存下可设置
batch_size=2,启用PyTorch CUDA加速大幅提升处理效率 - 手写文本优化:若手写识别效果差,可搭配开源手写OCR模型如CRNN,或微调LayoutLMv3时加入手写表单数据集
内容的提问来源于stack exchange,提问作者tobiasBora




