You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用BeautifulSoup抓取麦当劳利润表并转换为Pandas DataFrame?

Solution to Scrape MCD Income Statement from Yahoo Finance into DataFrame

Ah, I’ve run into this exact headache with Yahoo Finance’s financial pages before! The problem here is that the income table isn’t rendered as standard <tr>/<td> elements in the static HTML response—instead, all the financial data is tucked away in a JSON object inside one of the page’s <script> tags, loaded dynamically by JavaScript. Here’s how to extract it and turn it into a pandas DataFrame:

Step-by-Step Breakdown

  1. Find the embedded JSON: The data lives in a <script> tag that defines the window.__INITIAL_STATE__ variable. We’ll pull this JSON string and parse it into a Python dictionary.
  2. Traverse the JSON structure: Navigate the nested dictionary to locate the annual (or quarterly) income statement history.
  3. Reshape the data: Convert the nested JSON entries into a format pandas can easily convert into a table.

Complete Working Code

import requests
import re
import json
import pandas as pd
from bs4 import BeautifulSoup

url = "https://finance.yahoo.com/quote/MCD/financials?p=MCD"
result = requests.get(url)
result.raise_for_status()
result.encoding = "utf-8"
src = result.content
soup = BeautifulSoup(src, 'lxml')

# Locate the script tag holding the initial state JSON
script_tags = soup.find_all('script')
target_script = None
for script in script_tags:
    if script.string and '__INITIAL_STATE__' in script.string:
        target_script = script.string
        break

if not target_script:
    raise ValueError("Could not find the script tag containing financial data")

# Extract and parse the JSON content
json_match = re.search(r'window.__INITIAL_STATE__ = (.*?);', target_script)
if not json_match:
    raise ValueError("Failed to extract JSON data from the script")

financial_data = json.loads(json_match.group(1))

# Navigate to the annual income statement data
income_history = financial_data['context']['dispatcher']['stores']['QuoteSummaryStore']['incomeStatementHistory']['incomeStatementHistory']

# Format data for DataFrame
formatted_data = {}
for year_entry in income_history:
    # Extract the year from the date string (e.g., "2023-12-31" → "2023")
    year = year_entry['endDate']['fmt'].split('-')[0]
    # Iterate through each line item in the statement
    for item_key, item_value in year_entry.items():
        if item_key not in ['endDate', 'maxAge']:  # Skip non-financial fields
            if item_key not in formatted_data:
                formatted_data[item_key] = {}
            # Use formatted value (e.g., "25.41B") or raw numerical value with `item_value['raw']`
            formatted_data[item_key][year] = item_value['fmt']

# Convert to DataFrame and clean up
income_df = pd.DataFrame(formatted_data).T
income_df.index.name = 'Income Statement Item'
print(income_df)

Notes & Troubleshooting

  • JSON structure changes: Yahoo Finance occasionally updates their page layout. If the code breaks, inspect the __INITIAL_STATE__ JSON (using browser dev tools) to confirm the path to incomeStatementHistory is still valid.
  • Quarterly data: Swap incomeStatementHistory with incomeStatementHistoryQuarterly in the JSON path to pull quarterly results instead of annual.
  • Raw numerical values: Replace item_value['fmt'] with item_value['raw'] if you need unformatted numbers (e.g., 25410000000 instead of "25.41B").

内容的提问来源于stack exchange,提问作者Arthur Law

火山引擎 最新活动