Python解析PDF表格数据:仅提取到目标表格第二页的问题求助
Hey there! Let's sort out this PDF table extraction problem you're facing. Using PyPDF2 alone is tricky for structured tables because it focuses on raw text extraction, not preserving table formatting. That's why you're only getting partial data from the second page of your target table. Here's a more reliable approach:
Use tabula-py for Structured Table Extraction
tabula-py is designed specifically to detect and extract tables from PDFs, preserving their row/column structure—perfect for multi-page tables.
Step 1: Install the Library
First, install tabula-py and ensure you have Java installed (it's a dependency for the underlying tabula-java engine):
pip install tabula-py
Step 2: Extract the Target Table
This code will let you pull the table from document page 11 (your table's labeled "Page 2"), or even merge data across multiple pages if the table spans them:
import tabula # Target PDF URL pdf_url = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf' # Extract tables from page 11 (document page number) # Use pages='10-11' if the table spans both page 10 and 11 tables = tabula.read_pdf(pdf_url, pages='11', multiple_tables=True) # Access the first (and likely only) table on the page target_table = tables[0] # Print the table to verify print(target_table) # Optional: Save the table to a CSV file for easier handling tabula.convert_into(pdf_url, 'ct_dsg_table.csv', pages='11', output_format='csv')
Why This Works Better Than PyPDF2
- PyPDF2 extracts text as unstructured strings, so table rows/columns get mixed up, especially across pages.
tabula-pyidentifies table boundaries and preserves the grid structure, so you get clean, usable data.- It supports merging multi-page tables automatically if you specify the full page range.
If You Must Use PyPDF2 (Not Recommended)
If you need to stick with PyPDF2, you'll have to manually parse the text by:
- Extracting text from all relevant pages
- Splitting text into lines
- Using fixed-width columns or delimiters to split each line into table cells
- Handling page breaks to merge rows across pages
This method is error-prone and requires manual adjustment for your specific table's formatting—tabula-py is far more efficient for this use case.
内容的提问来源于stack exchange,提问作者Will Simpson




