如何通过Python及Pandas自动化将PDF提取的表格数据转换为指定格式的嵌套列表
Hey there! Sounds like you've already nailed the table extraction part with Camelot and pandas—great start! Let's walk through how to turn those extracted columns into the nested list you need, fully automated.
Step 1: Load Your Extracted Data
First, get your table data into a pandas DataFrame—either directly from Camelot or the Excel file you saved:
import pandas as pd from camelot import read_pdf # Option 1: Pull directly from the PDF via Camelot tables = read_pdf("national-tables-5-mgml-v3.pdf", pages="all") df = tables[0].df # Adjust the index if there are multiple tables in the PDF # Option 2: Load from your saved Excel file df = pd.read_excel("extracted_table.xlsx")
Step 2: Clean & Prepare Target Columns
You need the 1st and 3rd columns (remember pandas uses 0-based indexing, so that's iloc[:,0] and iloc[:,2]). First, make sure values are numeric—if they have units (like "mg/ml") or extra text, clean them up:
# Extract numeric values if columns include extra text (e.g., "5.75 mg/ml") df['lower_range'] = df.iloc[:,0].str.extract(r'(\d+\.\d+)').astype(float) df['upper_range'] = df.iloc[:,2].str.extract(r'(\d+\.\d+)').astype(float) # Drop rows with missing values to avoid errors in the final list clean_df = df.dropna(subset=['lower_range', 'upper_range'])
Step 3: Generate the Nested List
Now convert the cleaned columns into your desired nested list format:
# Create the nested list of value pairs oxalirange = clean_df.apply(lambda row: [row['lower_range'], row['upper_range']], axis=1).tolist() # If you need the exact formatted output from your example print(f"oxalirange = {tuple(oxalirange)}")
Key Adjustments for Your Specific Table
- If your extracted columns have clear names (instead of using
iloc), swap the index references with the actual column name (e.g.,df['Lower Dose Limit']). - If your ranges use whole numbers or different formatting, tweak the regex in
str.extract()to match (e.g.,r'(\d+)'for integers).
Official Documentation References
- For pandas data cleaning and transformation: Check out the official guides on DataFrame manipulation and numeric data conversion—they cover everything from type casting to row-wise operations.
- For refining Camelot table extraction: The official Camelot docs have detailed instructions on adjusting parameters (like
flavorortable_regions) to get more accurate table data if needed.
That should handle the automation you're looking for! Let me know if you run into specific formatting quirks with your table data.
内容的提问来源于stack exchange,提问作者Lord_Amenegg




