使用pdfplumber解析TSP季度投资PDF报表遇阻,寻求优化方案
Hey there, let's work through this TSP PDF parsing challenge to get your transaction data into Quicken. I've tackled similar unstructured PDF table scenarios with pdfplumber, so here's a breakdown of how to refine your approach:
Background
Since the federal TSP doesn't support direct online downloads, you're right to parse quarterly statements into QIF files for Quicken. With Python 3.8 and pdfplumber, your goal is to extract dated transactions from both the Transaction Detail By Source and Transaction Detail By Fund sections.
Current Code
Here's your existing code for reference:
import pdfplumber with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf: for x in pdf.pages : print (x.extract_text()) #print (x.extract_words()) #print (x.extract_tables())
Extracted Text Output
You've successfully pulled this structured text from the PDF:
TRANSACTION DETAIL BY SOURCE
Agency Payroll Posting Automatic Agency Office
Date Transaction Type Employee (1%) Matching Total
Beginning Balance $0.00 $0.00 $0.00 $0.00
12400001 03/22/11 Auto Enrollment Contribution 69.00 23.00 69.00 161.00
Increase/Decrease in Value 0.05 0.02 0.05 0.12
Ending Balance $69.05 $23.02 $69.05 $161.12TRANSACTION DETAIL BY FUND
Government Securities Investment (G) Fund
Number Posting Transaction Share of Dollar
Date Transaction Type Amount Price Shares Balance
Beginning Balance $13.4882 0.0000 $0.00
03/22/11 Auto Enrollment Contribution $161.00 13.5752 11.8599
Ending Balance $13.5854 11.8599 $161.12
The Problem: extract_tables() Returns Empty
pdfplumber's default extract_tables() relies on detecting table borders, which your TSP PDF likely doesn't use. That's why it's returning empty—we need to adjust how we identify table columns and rows.
Optimization Suggestions
1. Use Text-Based Table Detection
Override the default table detection settings to use text spacing instead of borders. This works because the columns are consistently aligned by whitespace:
import pdfplumber with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf: for page in pdf.pages: # Use text-based strategy to find tables without borders tables = page.extract_tables(table_settings={ "vertical_strategy": "text", # Use text gaps to detect columns "horizontal_strategy": "text", # Use text gaps to detect rows "snap_tolerance": 3, # Adjust if columns are slightly misaligned }) for table in tables: print("Extracted table:") for row in table: print(row)
This should now pick up the structured rows and columns in both sections.
2. Parse Line-by-Line with Regular Expressions
If text-based table detection still needs tweaks, you can parse the extracted text line-by-line using regex to target specific transaction patterns. This gives you full control over separating headers from data:
import pdfplumber import re # Regex patterns for each section source_transaction_re = re.compile(r'(\d{2}/\d{2}/\d{2})\s+(\w+\s+\w+\s+\w+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)') fund_transaction_re = re.compile(r'(\d{2}/\d{2}/\d{2})\s+(\w+\s+\w+\s+\w+)\s+\$([\d.]+)\s+([\d.]+)\s+([\d.]+)') with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf: for page in pdf.pages: text = page.extract_text() lines = text.split('\n') in_source_section = False in_fund_section = False current_fund = "" for line in lines: line = line.strip() if not line: continue # Toggle section tracking if line == "TRANSACTION DETAIL BY SOURCE": in_source_section = True in_fund_section = False continue elif line == "TRANSACTION DETAIL BY FUND": in_source_section = False in_fund_section = True continue # Capture fund name in the Fund section elif in_fund_section and not any(key in line for key in ["Date", "Beginning", "Ending"]): current_fund = line continue # Process Source section transactions if in_source_section: # Skip non-transaction rows if any(key in line for key in ["Beginning Balance", "Ending Balance", "Increase/Decrease in Value"]): continue match = source_transaction_re.search(line) if match: posting_date = match.group(1) trans_type = match.group(2) emp_contrib = match.group(3) agency_match = match.group(4) office_match = match.group(5) total = match.group(6) # Store this data for QIF generation print(f"Source Trans: {posting_date} | {trans_type} | Total: ${total}") # Process Fund section transactions if in_fund_section: if any(key in line for key in ["Beginning Balance", "Ending Balance"]): continue match = fund_transaction_re.search(line) if match: posting_date = match.group(1) trans_type = match.group(2) amount = match.group(3) share_price = match.group(4) shares = match.group(5) # Store this data for QIF generation print(f"Fund Trans: {posting_date} | {trans_type} | Fund: {current_fund} | Shares: {shares}")
3. Distinguish Headers from Data Rows
To make sure you're not mixing headers with transaction data:
- Check for keywords like
Date,Transaction Typeto flag header rows - Look for date patterns (
MM/DD/YY) to identify transaction rows - Skip summary rows like
Beginning BalanceorEnding Balance(unless you need that data for reconciliation)
4. Generate QIF Output
Once you've extracted the transaction data, you can format it into QIF entries. For example:
# Example QIF entry generation qif_entries = [] # For a fund transaction qif_entry = f"""!Type:Invst D{posting_date} N{trans_type} T{amount} MSHARES:{shares} N{current_fund} ^""" qif_entries.append(qif_entry) # Write to QIF file with open("tsp_transactions.qif", "w") as f: f.write("\n".join(qif_entries))
Final Tips
- Use
page.extract_words()to get x/y coordinates of each text element if you need to group columns by position (great for misaligned text) - Test with different
snap_tolerancevalues in table settings if columns are slightly off - Check
page.debug_tablefinder()to visualize how pdfplumber is identifying table regions (helpful for troubleshooting)
内容的提问来源于stack exchange,提问作者user2754571




