You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用pdfplumber解析TSP季度投资PDF报表遇阻,寻求优化方案

Hey there, let's work through this TSP PDF parsing challenge to get your transaction data into Quicken. I've tackled similar unstructured PDF table scenarios with pdfplumber, so here's a breakdown of how to refine your approach:

TSP PDF Transaction Parsing for Quicken QIF Import

Background

Since the federal TSP doesn't support direct online downloads, you're right to parse quarterly statements into QIF files for Quicken. With Python 3.8 and pdfplumber, your goal is to extract dated transactions from both the Transaction Detail By Source and Transaction Detail By Fund sections.

Current Code

Here's your existing code for reference:

import pdfplumber
with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf:
    for x in pdf.pages :
        print (x.extract_text())
        #print (x.extract_words())
        #print (x.extract_tables())

Extracted Text Output

You've successfully pulled this structured text from the PDF:

TRANSACTION DETAIL BY SOURCE
Agency Payroll Posting Automatic Agency Office
Date Transaction Type Employee (1%) Matching Total
Beginning Balance $0.00 $0.00 $0.00 $0.00
12400001 03/22/11 Auto Enrollment Contribution 69.00 23.00 69.00 161.00
Increase/Decrease in Value 0.05 0.02 0.05 0.12
Ending Balance $69.05 $23.02 $69.05 $161.12

TRANSACTION DETAIL BY FUND
Government Securities Investment (G) Fund
Number Posting Transaction Share of Dollar
Date Transaction Type Amount Price Shares Balance
Beginning Balance $13.4882 0.0000 $0.00
03/22/11 Auto Enrollment Contribution $161.00 13.5752 11.8599
Ending Balance $13.5854 11.8599 $161.12

The Problem: extract_tables() Returns Empty

pdfplumber's default extract_tables() relies on detecting table borders, which your TSP PDF likely doesn't use. That's why it's returning empty—we need to adjust how we identify table columns and rows.

Optimization Suggestions

1. Use Text-Based Table Detection

Override the default table detection settings to use text spacing instead of borders. This works because the columns are consistently aligned by whitespace:

import pdfplumber

with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf:
    for page in pdf.pages:
        # Use text-based strategy to find tables without borders
        tables = page.extract_tables(table_settings={
            "vertical_strategy": "text",  # Use text gaps to detect columns
            "horizontal_strategy": "text",  # Use text gaps to detect rows
            "snap_tolerance": 3,  # Adjust if columns are slightly misaligned
        })
        for table in tables:
            print("Extracted table:")
            for row in table:
                print(row)

This should now pick up the structured rows and columns in both sections.

2. Parse Line-by-Line with Regular Expressions

If text-based table detection still needs tweaks, you can parse the extracted text line-by-line using regex to target specific transaction patterns. This gives you full control over separating headers from data:

import pdfplumber
import re

# Regex patterns for each section
source_transaction_re = re.compile(r'(\d{2}/\d{2}/\d{2})\s+(\w+\s+\w+\s+\w+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)')
fund_transaction_re = re.compile(r'(\d{2}/\d{2}/\d{2})\s+(\w+\s+\w+\s+\w+)\s+\$([\d.]+)\s+([\d.]+)\s+([\d.]+)')

with pdfplumber.open(r'C:\Users\ra_d\\statements\Investments\TSP\1Q 2011.pdf') as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        lines = text.split('\n')
        in_source_section = False
        in_fund_section = False
        current_fund = ""

        for line in lines:
            line = line.strip()
            if not line:
                continue
            
            # Toggle section tracking
            if line == "TRANSACTION DETAIL BY SOURCE":
                in_source_section = True
                in_fund_section = False
                continue
            elif line == "TRANSACTION DETAIL BY FUND":
                in_source_section = False
                in_fund_section = True
                continue
            # Capture fund name in the Fund section
            elif in_fund_section and not any(key in line for key in ["Date", "Beginning", "Ending"]):
                current_fund = line
                continue
            
            # Process Source section transactions
            if in_source_section:
                # Skip non-transaction rows
                if any(key in line for key in ["Beginning Balance", "Ending Balance", "Increase/Decrease in Value"]):
                    continue
                match = source_transaction_re.search(line)
                if match:
                    posting_date = match.group(1)
                    trans_type = match.group(2)
                    emp_contrib = match.group(3)
                    agency_match = match.group(4)
                    office_match = match.group(5)
                    total = match.group(6)
                    # Store this data for QIF generation
                    print(f"Source Trans: {posting_date} | {trans_type} | Total: ${total}")
            
            # Process Fund section transactions
            if in_fund_section:
                if any(key in line for key in ["Beginning Balance", "Ending Balance"]):
                    continue
                match = fund_transaction_re.search(line)
                if match:
                    posting_date = match.group(1)
                    trans_type = match.group(2)
                    amount = match.group(3)
                    share_price = match.group(4)
                    shares = match.group(5)
                    # Store this data for QIF generation
                    print(f"Fund Trans: {posting_date} | {trans_type} | Fund: {current_fund} | Shares: {shares}")

3. Distinguish Headers from Data Rows

To make sure you're not mixing headers with transaction data:

  • Check for keywords like Date, Transaction Type to flag header rows
  • Look for date patterns (MM/DD/YY) to identify transaction rows
  • Skip summary rows like Beginning Balance or Ending Balance (unless you need that data for reconciliation)

4. Generate QIF Output

Once you've extracted the transaction data, you can format it into QIF entries. For example:

# Example QIF entry generation
qif_entries = []
# For a fund transaction
qif_entry = f"""!Type:Invst
D{posting_date}
N{trans_type}
T{amount}
MSHARES:{shares}
N{current_fund}
^"""
qif_entries.append(qif_entry)

# Write to QIF file
with open("tsp_transactions.qif", "w") as f:
    f.write("\n".join(qif_entries))

Final Tips

  • Use page.extract_words() to get x/y coordinates of each text element if you need to group columns by position (great for misaligned text)
  • Test with different snap_tolerance values in table settings if columns are slightly off
  • Check page.debug_tablefinder() to visualize how pdfplumber is identifying table regions (helpful for troubleshooting)

内容的提问来源于stack exchange,提问作者user2754571

火山引擎 最新活动