如何通过Python及Pandas自动化将PDF提取的表格数据转换为指定格式的嵌套列表

阿华AIGC实验室

2026-4-29

Automating Nested List Generation from Extracted PDF Tables with Python & Pandas

Hey there! Sounds like you've already nailed the table extraction part with Camelot and pandas—great start! Let's walk through how to turn those extracted columns into the nested list you need, fully automated.

Step 1: Load Your Extracted Data

First, get your table data into a pandas DataFrame—either directly from Camelot or the Excel file you saved:

import pandas as pd
from camelot import read_pdf

# Option 1: Pull directly from the PDF via Camelot
tables = read_pdf("national-tables-5-mgml-v3.pdf", pages="all")
df = tables[0].df  # Adjust the index if there are multiple tables in the PDF

# Option 2: Load from your saved Excel file
df = pd.read_excel("extracted_table.xlsx")

Step 2: Clean & Prepare Target Columns

You need the 1st and 3rd columns (remember pandas uses 0-based indexing, so that's iloc[:,0] and iloc[:,2]). First, make sure values are numeric—if they have units (like "mg/ml") or extra text, clean them up:

# Extract numeric values if columns include extra text (e.g., "5.75 mg/ml")
df['lower_range'] = df.iloc[:,0].str.extract(r'(\d+\.\d+)').astype(float)
df['upper_range'] = df.iloc[:,2].str.extract(r'(\d+\.\d+)').astype(float)

# Drop rows with missing values to avoid errors in the final list
clean_df = df.dropna(subset=['lower_range', 'upper_range'])

Step 3: Generate the Nested List

Now convert the cleaned columns into your desired nested list format:

# Create the nested list of value pairs
oxalirange = clean_df.apply(lambda row: [row['lower_range'], row['upper_range']], axis=1).tolist()

# If you need the exact formatted output from your example
print(f"oxalirange = {tuple(oxalirange)}")

Key Adjustments for Your Specific Table

If your extracted columns have clear names (instead of using iloc), swap the index references with the actual column name (e.g., df['Lower Dose Limit']).
If your ranges use whole numbers or different formatting, tweak the regex in str.extract() to match (e.g., r'(\d+)' for integers).

Official Documentation References

For pandas data cleaning and transformation: Check out the official guides on DataFrame manipulation and numeric data conversion—they cover everything from type casting to row-wise operations.
For refining Camelot table extraction: The official Camelot docs have detailed instructions on adjusting parameters (like flavor or table_regions) to get more accurate table data if needed.

That should handle the automation you're looking for! Let me know if you run into specific formatting quirks with your table data.

内容的提问来源于stack exchange，提问作者Lord_Amenegg