You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python解析PDF表格数据:仅提取到目标表格第二页的问题求助

Fixing PDF Table Extraction (Only Getting Second Page of First Table)

Hey there! Let's sort out this PDF table extraction problem you're facing. Using PyPDF2 alone is tricky for structured tables because it focuses on raw text extraction, not preserving table formatting. That's why you're only getting partial data from the second page of your target table. Here's a more reliable approach:

Use tabula-py for Structured Table Extraction

tabula-py is designed specifically to detect and extract tables from PDFs, preserving their row/column structure—perfect for multi-page tables.

Step 1: Install the Library

First, install tabula-py and ensure you have Java installed (it's a dependency for the underlying tabula-java engine):

pip install tabula-py

Step 2: Extract the Target Table

This code will let you pull the table from document page 11 (your table's labeled "Page 2"), or even merge data across multiple pages if the table spans them:

import tabula

# Target PDF URL
pdf_url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

# Extract tables from page 11 (document page number)
# Use pages='10-11' if the table spans both page 10 and 11
tables = tabula.read_pdf(pdf_url, pages='11', multiple_tables=True)

# Access the first (and likely only) table on the page
target_table = tables[0]

# Print the table to verify
print(target_table)

# Optional: Save the table to a CSV file for easier handling
tabula.convert_into(pdf_url, 'ct_dsg_table.csv', pages='11', output_format='csv')

Why This Works Better Than PyPDF2

  • PyPDF2 extracts text as unstructured strings, so table rows/columns get mixed up, especially across pages.
  • tabula-py identifies table boundaries and preserves the grid structure, so you get clean, usable data.
  • It supports merging multi-page tables automatically if you specify the full page range.

If you need to stick with PyPDF2, you'll have to manually parse the text by:

  • Extracting text from all relevant pages
  • Splitting text into lines
  • Using fixed-width columns or delimiters to split each line into table cells
  • Handling page breaks to merge rows across pages

This method is error-prone and requires manual adjustment for your specific table's formatting—tabula-py is far more efficient for this use case.

内容的提问来源于stack exchange,提问作者Will Simpson

火山引擎 最新活动