如何用BeautifulSoup抓取麦当劳利润表并转换为Pandas DataFrame?
Solution to Scrape MCD Income Statement from Yahoo Finance into DataFrame
Ah, I’ve run into this exact headache with Yahoo Finance’s financial pages before! The problem here is that the income table isn’t rendered as standard <tr>/<td> elements in the static HTML response—instead, all the financial data is tucked away in a JSON object inside one of the page’s <script> tags, loaded dynamically by JavaScript. Here’s how to extract it and turn it into a pandas DataFrame:
Step-by-Step Breakdown
- Find the embedded JSON: The data lives in a
<script>tag that defines thewindow.__INITIAL_STATE__variable. We’ll pull this JSON string and parse it into a Python dictionary. - Traverse the JSON structure: Navigate the nested dictionary to locate the annual (or quarterly) income statement history.
- Reshape the data: Convert the nested JSON entries into a format pandas can easily convert into a table.
Complete Working Code
import requests import re import json import pandas as pd from bs4 import BeautifulSoup url = "https://finance.yahoo.com/quote/MCD/financials?p=MCD" result = requests.get(url) result.raise_for_status() result.encoding = "utf-8" src = result.content soup = BeautifulSoup(src, 'lxml') # Locate the script tag holding the initial state JSON script_tags = soup.find_all('script') target_script = None for script in script_tags: if script.string and '__INITIAL_STATE__' in script.string: target_script = script.string break if not target_script: raise ValueError("Could not find the script tag containing financial data") # Extract and parse the JSON content json_match = re.search(r'window.__INITIAL_STATE__ = (.*?);', target_script) if not json_match: raise ValueError("Failed to extract JSON data from the script") financial_data = json.loads(json_match.group(1)) # Navigate to the annual income statement data income_history = financial_data['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistory']['incomeStatementHistory'] # Format data for DataFrame formatted_data = {} for year_entry in income_history: # Extract the year from the date string (e.g., "2023-12-31" → "2023") year = year_entry['endDate']['fmt'].split('-')[0] # Iterate through each line item in the statement for item_key, item_value in year_entry.items(): if item_key not in ['endDate', 'maxAge']: # Skip non-financial fields if item_key not in formatted_data: formatted_data[item_key] = {} # Use formatted value (e.g., "25.41B") or raw numerical value with `item_value['raw']` formatted_data[item_key][year] = item_value['fmt'] # Convert to DataFrame and clean up income_df = pd.DataFrame(formatted_data).T income_df.index.name = 'Income Statement Item' print(income_df)
Notes & Troubleshooting
- JSON structure changes: Yahoo Finance occasionally updates their page layout. If the code breaks, inspect the
__INITIAL_STATE__JSON (using browser dev tools) to confirm the path toincomeStatementHistoryis still valid. - Quarterly data: Swap
incomeStatementHistorywithincomeStatementHistoryQuarterlyin the JSON path to pull quarterly results instead of annual. - Raw numerical values: Replace
item_value['fmt']withitem_value['raw']if you need unformatted numbers (e.g.,25410000000instead of"25.41B").
内容的提问来源于stack exchange,提问作者Arthur Law




