You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python批量抓取Excel文件并导入Pandas DataFrame构建时间序列

Absolutely! You can totally automate this process—no more manual downloads needed. Let's walk through a step-by-step solution that uses BeautifulSoup to scrape the links, downloads the files automatically, and loads everything into a pandas DataFrame (with optional filtering for specific years).

Step 1: Install Required Libraries

First, make sure you have the necessary packages installed. Open your terminal/command prompt and run:

pip install requests beautifulsoup4 pandas openpyxl xlrd==1.2.0
  • requests fetches the webpage content
  • beautifulsoup4 parses the HTML to extract links
  • pandas handles the DataFrames
  • openpyxl reads modern .xlsx files
  • xlrd==1.2.0 supports older .xls files (newer versions of xlrd drop .xls support)

This code will fetch the CFTC page, parse it, and pull out all the financial futures historical Excel zip links:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from zipfile import ZipFile
import io
import time

# Target CFTC page URL
base_url = "https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm"

# Fetch the page content
response = requests.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract all financial futures Excel zip links
download_links = []
for link in soup.find_all("a", href=True):
    href = link["href"]
    # Filter for financial futures historical zips (matches the site's naming pattern)
    if "fut_fin_xls_" in href and href.endswith(".zip"):
        # Convert relative links to absolute URLs
        if not href.startswith("http"):
            href = f"https://www.cftc.gov{href}"
        download_links.append(href)

print(f"Found {len(download_links)} relevant files")

If you don't want every single year, add this to filter for a range or specific years:

# Example: Keep only links for 2010-2020
target_years = range(2010, 2021)
filtered_links = [
    link for link in download_links 
    if any(str(year) in link for year in target_years)
]

# Or pick specific years like [2015, 2018, 2020]
# target_years = [2015, 2018, 2020]
Step 4: Download, Extract, and Combine into a Single DataFrame

This part will loop through the links, download each zip, extract the Excel file inside, read it into a DataFrame, and combine everything:

# Initialize a list to hold each year's data
df_list = []

for link in filtered_links:
    print(f"Processing {link}...")
    # Download the zip file
    zip_response = requests.get(link)
    # Read the zip in memory (no need to save to disk)
    with ZipFile(io.BytesIO(zip_response.content)) as zip_file:
        # Get the Excel filename from the zip (there's usually only one per zip)
        excel_files = [f for f in zip_file.namelist() if f.endswith((".xls", ".xlsx"))]
        if not excel_files:
            print(f"No Excel file found in {link}")
            continue
        excel_filename = excel_files[0]
        
        # Choose the right engine based on file type
        engine = "openpyxl" if excel_filename.endswith(".xlsx") else "xlrd"
        
        # Read the Excel file into a DataFrame
        with zip_file.open(excel_filename) as excel_file:
            df = pd.read_excel(excel_file, engine=engine)
            # Add a year column to track where the data came from
            year = link.split("_")[-1].replace(".zip", "")
            df["Data_Year"] = year
            df_list.append(df)
    
    # Add a small delay to avoid overwhelming the server
    time.sleep(1)

# Combine all DataFrames into one
combined_df = pd.concat(df_list, ignore_index=True)

# Check the result
print("\nCombined DataFrame preview:")
print(combined_df.head())
Key Notes for Python Newbies
  • Website Structure Changes: If the CFTC updates their page layout or file naming, you might need to adjust the filter ("fut_fin_xls_") to match the new link patterns.
  • Rate Limiting: The time.sleep(1) adds a 1-second pause between downloads to be polite to the server—don't remove this unless you want to risk getting blocked.
  • Troubleshooting: If you get errors reading Excel files, double-check that you installed both openpyxl and xlrd==1.2.0.
  • Saving the Data: To save your combined DataFrame to a local Excel file, run combined_df.to_excel("cftc_financial_futures_data.xlsx", index=False)

内容的提问来源于stack exchange,提问作者tsmith41094

火山引擎 最新活动