如何用Docling库提取PDF表格并转为保留原格式的结构化HTML?
问题
我正在使用Docling库将PDF文件内容提取并转换为HTML格式,以保留原文档的结构与格式,使用的代码如下:
from docling.document_converter import DocumentConverter, PdfFormatOption from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.datamodel.base_models import InputFormat from docling_core.types.doc import ImageRefMode from pathlib import Path import logging # Set up logging logging.basicConfig(level=logging.INFO) log = logging.getLogger(__name__) # Corrected: _name_ -> __name__ # Configure image settings IMAGE_RESOLUTION_SCALE = 2.0 # Path to your PDF file source = Path(r"C:\Users\Downloads\Journal.pdf") output_path = Path(r"C:\Users\Desktop\output20.html") # Configure pipeline options for image handling pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True # Create converter instance with image options converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) # Convert PDF to document result = converter.convert(source) # Save HTML with embedded images result.document.save_as_html(output_path, image_mode=ImageRefMode.EMBEDDED) log.info(f"HTML file with embedded images created at: {output_path}")
目前转换后的HTML内容顺序与原PDF一致,但表格结构的处理存在以下问题:
- 表格未被正确识别,表格数据以
<p>标签呈现而非标准的<table>、<tr>、<td>结构; - 表格内容错位,单元格内容被错误拆分,且单元格内的图片被移至表格外;
- 含嵌入图片的复杂表格无法被正确保留。
我的预期效果是:提取PDF表格时保留正确的行列结构,以标准HTML表格标签呈现,同时保留原格式、对齐方式及单元格内容。
我已尝试的操作:检查HTML输出、调整PdfPipelineOptions参数、对比Document Intelligence库(该库页眉页脚提取更优,但复杂表格仍处理不佳)。
现寻求解决方案:如何通过Docling库正确提取PDF表格并转换为保留原布局与格式的结构化HTML?
解决方案
1. 启用表格检测与结构化核心配置
Docling默认未完全开启结构化表格提取,需在PdfPipelineOptions中添加表格处理相关参数:
pipeline_options = PdfPipelineOptions() # 开启表格检测功能 pipeline_options.enable_table_detection = True # 设置结构化表格识别模式,优先生成标准HTML表格标签 pipeline_options.table_recognition_mode = "STRUCTURED" # 保留单元格内的格式与图片位置 pipeline_options.preserve_cell_content_formatting = True # 通过字体分析辅助识别表格边界 pipeline_options.enable_font_analysis = True # 原有图片配置 pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True
2. 调整HTML保存时的表格渲染策略
调用save_as_html时,强制指定结构化表格渲染模式,避免降级为流式文本:
result.document.save_as_html( output_path, image_mode=ImageRefMode.EMBEDDED, render_tables_as_structured=True )
3. 完整调整后的代码
from docling.document_converter import DocumentConverter, PdfFormatOption from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.datamodel.base_models import InputFormat from docling_core.types.doc import ImageRefMode from pathlib import Path import logging # Set up logging logging.basicConfig(level=logging.INFO) log = logging.getLogger(__name__) # Configure image settings IMAGE_RESOLUTION_SCALE = 2.5 # 提升分辨率帮助精准识别表格 # Path to your PDF file source = Path(r"C:\Users\Downloads\Journal.pdf") output_path = Path(r"C:\Users\Desktop\output20.html") # Configure pipeline options for table and image handling pipeline_options = PdfPipelineOptions() # 表格处理核心配置 pipeline_options.enable_table_detection = True pipeline_options.table_recognition_mode = "STRUCTURED" pipeline_options.preserve_cell_content_formatting = True pipeline_options.enable_font_analysis = True # 图片配置 pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True # Create converter instance with updated options converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) # Convert PDF to document result = converter.convert(source) # Save HTML with structured tables and embedded images result.document.save_as_html( output_path, image_mode=ImageRefMode.EMBEDDED, render_tables_as_structured=True ) log.info(f"HTML file with structured tables created at: {output_path}")
参数说明
enable_table_detection:触发Docling对PDF中表格区域的识别;table_recognition_mode = "STRUCTURED":强制以行列结构解析表格,生成标准<table>系列标签;preserve_cell_content_formatting:保留单元格内的文本样式、图片位置,避免内容错位或图片被移出表格;enable_font_analysis:通过字体大小、样式差异辅助判断表格边界与单元格范围;render_tables_as_structured=True:确保HTML输出时直接渲染结构化表格,而非将表格内容转为普通段落。
内容的提问来源于stack exchange,提问作者Akshata




