如何用Python批量抓取Excel文件并导入Pandas DataFrame构建时间序列
Absolutely! You can totally automate this process—no more manual downloads needed. Let's walk through a step-by-step solution that uses BeautifulSoup to scrape the links, downloads the files automatically, and loads everything into a pandas DataFrame (with optional filtering for specific years).
First, make sure you have the necessary packages installed. Open your terminal/command prompt and run:
pip install requests beautifulsoup4 pandas openpyxl xlrd==1.2.0
requestsfetches the webpage contentbeautifulsoup4parses the HTML to extract linkspandashandles the DataFramesopenpyxlreads modern .xlsx filesxlrd==1.2.0supports older .xls files (newer versions of xlrd drop .xls support)
This code will fetch the CFTC page, parse it, and pull out all the financial futures historical Excel zip links:
import requests from bs4 import BeautifulSoup import pandas as pd from zipfile import ZipFile import io import time # Target CFTC page URL base_url = "https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm" # Fetch the page content response = requests.get(base_url) soup = BeautifulSoup(response.text, "html.parser") # Extract all financial futures Excel zip links download_links = [] for link in soup.find_all("a", href=True): href = link["href"] # Filter for financial futures historical zips (matches the site's naming pattern) if "fut_fin_xls_" in href and href.endswith(".zip"): # Convert relative links to absolute URLs if not href.startswith("http"): href = f"https://www.cftc.gov{href}" download_links.append(href) print(f"Found {len(download_links)} relevant files")
If you don't want every single year, add this to filter for a range or specific years:
# Example: Keep only links for 2010-2020 target_years = range(2010, 2021) filtered_links = [ link for link in download_links if any(str(year) in link for year in target_years) ] # Or pick specific years like [2015, 2018, 2020] # target_years = [2015, 2018, 2020]
This part will loop through the links, download each zip, extract the Excel file inside, read it into a DataFrame, and combine everything:
# Initialize a list to hold each year's data df_list = [] for link in filtered_links: print(f"Processing {link}...") # Download the zip file zip_response = requests.get(link) # Read the zip in memory (no need to save to disk) with ZipFile(io.BytesIO(zip_response.content)) as zip_file: # Get the Excel filename from the zip (there's usually only one per zip) excel_files = [f for f in zip_file.namelist() if f.endswith((".xls", ".xlsx"))] if not excel_files: print(f"No Excel file found in {link}") continue excel_filename = excel_files[0] # Choose the right engine based on file type engine = "openpyxl" if excel_filename.endswith(".xlsx") else "xlrd" # Read the Excel file into a DataFrame with zip_file.open(excel_filename) as excel_file: df = pd.read_excel(excel_file, engine=engine) # Add a year column to track where the data came from year = link.split("_")[-1].replace(".zip", "") df["Data_Year"] = year df_list.append(df) # Add a small delay to avoid overwhelming the server time.sleep(1) # Combine all DataFrames into one combined_df = pd.concat(df_list, ignore_index=True) # Check the result print("\nCombined DataFrame preview:") print(combined_df.head())
- Website Structure Changes: If the CFTC updates their page layout or file naming, you might need to adjust the filter (
"fut_fin_xls_") to match the new link patterns. - Rate Limiting: The
time.sleep(1)adds a 1-second pause between downloads to be polite to the server—don't remove this unless you want to risk getting blocked. - Troubleshooting: If you get errors reading Excel files, double-check that you installed both
openpyxlandxlrd==1.2.0. - Saving the Data: To save your combined DataFrame to a local Excel file, run
combined_df.to_excel("cftc_financial_futures_data.xlsx", index=False)
内容的提问来源于stack exchange,提问作者tsmith41094




