Python解析PDF表格数据：仅提取到目标表格第二页的问题求助

阿华AIGC实验室

2026-5-20

Fixing PDF Table Extraction (Only Getting Second Page of First Table)

Hey there! Let's sort out this PDF table extraction problem you're facing. Using PyPDF2 alone is tricky for structured tables because it focuses on raw text extraction, not preserving table formatting. That's why you're only getting partial data from the second page of your target table. Here's a more reliable approach:

Use `tabula-py` for Structured Table Extraction

tabula-py is designed specifically to detect and extract tables from PDFs, preserving their row/column structure—perfect for multi-page tables.

Step 1: Install the Library

First, install tabula-py and ensure you have Java installed (it's a dependency for the underlying tabula-java engine):

pip install tabula-py

Step 2: Extract the Target Table

This code will let you pull the table from document page 11 (your table's labeled "Page 2"), or even merge data across multiple pages if the table spans them:

import tabula

# Target PDF URL
pdf_url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

# Extract tables from page 11 (document page number)
# Use pages='10-11' if the table spans both page 10 and 11
tables = tabula.read_pdf(pdf_url, pages='11', multiple_tables=True)

# Access the first (and likely only) table on the page
target_table = tables[0]

# Print the table to verify
print(target_table)

# Optional: Save the table to a CSV file for easier handling
tabula.convert_into(pdf_url, 'ct_dsg_table.csv', pages='11', output_format='csv')

Why This Works Better Than PyPDF2

PyPDF2 extracts text as unstructured strings, so table rows/columns get mixed up, especially across pages.
tabula-py identifies table boundaries and preserves the grid structure, so you get clean, usable data.
It supports merging multi-page tables automatically if you specify the full page range.