如何用Python批量抓取Excel文件并导入Pandas DataFrame构建时间序列

阿华AIGC实验室

2026-5-14

Absolutely! You can totally automate this process—no more manual downloads needed. Let's walk through a step-by-step solution that uses BeautifulSoup to scrape the links, downloads the files automatically, and loads everything into a pandas DataFrame (with optional filtering for specific years).

Step 1: Install Required Libraries

First, make sure you have the necessary packages installed. Open your terminal/command prompt and run:

pip install requests beautifulsoup4 pandas openpyxl xlrd==1.2.0

requests fetches the webpage content
beautifulsoup4 parses the HTML to extract links
pandas handles the DataFrames
openpyxl reads modern .xlsx files
xlrd==1.2.0 supports older .xls files (newer versions of xlrd drop .xls support)

Step 2: Scrape All Relevant Download Links

This code will fetch the CFTC page, parse it, and pull out all the financial futures historical Excel zip links:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from zipfile import ZipFile
import io
import time

# Target CFTC page URL
base_url = "https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm"

# Fetch the page content
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract all financial futures Excel zip links
download_links = []
for link in soup.find_all("a", href=True):
    href = link["href"]
    # Filter for financial futures historical zips (matches the site's naming pattern)
    if "fut_fin_xls_" in href and href.endswith(".zip"):
        # Convert relative links to absolute URLs
        if not href.startswith("http"):
            href = f"https://www.cftc.gov{href}"
        download_links.append(href)

print(f"Found {len(download_links)} relevant files")

Step 3: Filter Links for Specific Years (Optional)

If you don't want every single year, add this to filter for a range or specific years:

# Example: Keep only links for 2010-2020
target_years = range(2010, 2021)
filtered_links = [
    link for link in download_links 
    if any(str(year) in link for year in target_years)
]

# Or pick specific years like [2015, 2018, 2020]
# target_years = [2015, 2018, 2020]

Step 4: Download, Extract, and Combine into a Single DataFrame

This part will loop through the links, download each zip, extract the Excel file inside, read it into a DataFrame, and combine everything:

# Initialize a list to hold each year's data
df_list = []

for link in filtered_links:
    print(f"Processing {link}...")
    # Download the zip file
    zip_response = requests.get(link)
    # Read the zip in memory (no need to save to disk)
    with ZipFile(io.BytesIO(zip_response.content)) as zip_file:
        # Get the Excel filename from the zip (there's usually only one per zip)
        excel_files = [f for f in zip_file.namelist() if f.endswith((".xls", ".xlsx"))]
        if not excel_files:
            print(f"No Excel file found in {link}")
            continue
        excel_filename = excel_files[0]
        
        # Choose the right engine based on file type
        engine = "openpyxl" if excel_filename.endswith(".xlsx") else "xlrd"
        
        # Read the Excel file into a DataFrame
        with zip_file.open(excel_filename) as excel_file:
            df = pd.read_excel(excel_file, engine=engine)
            # Add a year column to track where the data came from
            year = link.split("_")[-1].replace(".zip", "")
            df["Data_Year"] = year
            df_list.append(df)
    
    # Add a small delay to avoid overwhelming the server
    time.sleep(1)

# Combine all DataFrames into one
combined_df = pd.concat(df_list, ignore_index=True)

# Check the result
print("\nCombined DataFrame preview:")
print(combined_df.head())

Key Notes for Python Newbies

Website Structure Changes: If the CFTC updates their page layout or file naming, you might need to adjust the filter ("fut_fin_xls_") to match the new link patterns.
Rate Limiting: The time.sleep(1) adds a 1-second pause between downloads to be polite to the server—don't remove this unless you want to risk getting blocked.
Troubleshooting: If you get errors reading Excel files, double-check that you installed both openpyxl and xlrd==1.2.0.
Saving the Data: To save your combined DataFrame to a local Excel file, run combined_df.to_excel("cftc_financial_futures_data.xlsx", index=False)

内容的提问来源于stack exchange，提问作者tsmith41094