You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Docling库提取PDF表格并转为保留原格式的结构化HTML?

问题

我正在使用Docling库将PDF文件内容提取并转换为HTML格式,以保留原文档的结构与格式,使用的代码如下:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode
from pathlib import Path
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)  # Corrected: _name_ -> __name__

# Configure image settings
IMAGE_RESOLUTION_SCALE = 2.0

# Path to your PDF file
source = Path(r"C:\Users\Downloads\Journal.pdf")
output_path = Path(r"C:\Users\Desktop\output20.html")

# Configure pipeline options for image handling
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

# Create converter instance with image options
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Convert PDF to document
result = converter.convert(source)

# Save HTML with embedded images
result.document.save_as_html(output_path, image_mode=ImageRefMode.EMBEDDED)

log.info(f"HTML file with embedded images created at: {output_path}")

目前转换后的HTML内容顺序与原PDF一致,但表格结构的处理存在以下问题:

  • 表格未被正确识别,表格数据以<p>标签呈现而非标准的<table><tr><td>结构;
  • 表格内容错位,单元格内容被错误拆分,且单元格内的图片被移至表格外;
  • 含嵌入图片的复杂表格无法被正确保留。

我的预期效果是:提取PDF表格时保留正确的行列结构,以标准HTML表格标签呈现,同时保留原格式、对齐方式及单元格内容。

我已尝试的操作:检查HTML输出、调整PdfPipelineOptions参数、对比Document Intelligence库(该库页眉页脚提取更优,但复杂表格仍处理不佳)。

现寻求解决方案:如何通过Docling库正确提取PDF表格并转换为保留原布局与格式的结构化HTML?

解决方案

1. 启用表格检测与结构化核心配置

Docling默认未完全开启结构化表格提取,需在PdfPipelineOptions中添加表格处理相关参数:

pipeline_options = PdfPipelineOptions()
# 开启表格检测功能
pipeline_options.enable_table_detection = True
# 设置结构化表格识别模式,优先生成标准HTML表格标签
pipeline_options.table_recognition_mode = "STRUCTURED"
# 保留单元格内的格式与图片位置
pipeline_options.preserve_cell_content_formatting = True
# 通过字体分析辅助识别表格边界
pipeline_options.enable_font_analysis = True
# 原有图片配置
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

2. 调整HTML保存时的表格渲染策略

调用save_as_html时,强制指定结构化表格渲染模式,避免降级为流式文本:

result.document.save_as_html(
    output_path,
    image_mode=ImageRefMode.EMBEDDED,
    render_tables_as_structured=True
)

3. 完整调整后的代码

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode
from pathlib import Path
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

# Configure image settings
IMAGE_RESOLUTION_SCALE = 2.5  # 提升分辨率帮助精准识别表格

# Path to your PDF file
source = Path(r"C:\Users\Downloads\Journal.pdf")
output_path = Path(r"C:\Users\Desktop\output20.html")

# Configure pipeline options for table and image handling
pipeline_options = PdfPipelineOptions()
# 表格处理核心配置
pipeline_options.enable_table_detection = True
pipeline_options.table_recognition_mode = "STRUCTURED"
pipeline_options.preserve_cell_content_formatting = True
pipeline_options.enable_font_analysis = True
# 图片配置
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

# Create converter instance with updated options
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Convert PDF to document
result = converter.convert(source)

# Save HTML with structured tables and embedded images
result.document.save_as_html(
    output_path,
    image_mode=ImageRefMode.EMBEDDED,
    render_tables_as_structured=True
)

log.info(f"HTML file with structured tables created at: {output_path}")

参数说明

  • enable_table_detection:触发Docling对PDF中表格区域的识别;
  • table_recognition_mode = "STRUCTURED":强制以行列结构解析表格,生成标准<table>系列标签;
  • preserve_cell_content_formatting:保留单元格内的文本样式、图片位置,避免内容错位或图片被移出表格;
  • enable_font_analysis:通过字体大小、样式差异辅助判断表格边界与单元格范围;
  • render_tables_as_structured=True:确保HTML输出时直接渲染结构化表格,而非将表格内容转为普通段落。

内容的提问来源于stack exchange,提问作者Akshata

火山引擎 最新活动