You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用pandas.read_html获取SocialBlade网站YouTube新闻类Top100高浏览量频道表格失败,求最简解决方案

Scrape SocialBlade YouTube News Top 100 Table into pandas DataFrame

Problem Description

I'm trying to use pandas.read_html to scrape the table of YouTube's top 100 news channels by views from SocialBlade. First, I tried the direct approach:

import pandas as pd
df = pd.read_html('https://socialblade.com/youtube/top/category/news/mostviewed')

But this threw an HTTPError: HTTP Error 403: Forbidden. After checking related discussions, I tried adding browser headers with requests:

import requests
import pandas as pd
url = 'https://socialblade.com/youtube/top/category/news/mostviewed'
header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}
df = pd.read_html(requests.get(url, headers=header).text)

Now I get ValueError: No tables found. What's the simplest way to convert this table into a pandas DataFrame?

Solution

1. Use pandas.read_html with Updated Headers (Simplest Static Approach)

The 403 error happens because SocialBlade blocks requests that don't mimic a real browser. Instead of using requests separately, you can pass proper headers directly to pandas.read_html—this keeps your code concise and avoids parsing issues with raw HTML.

Try using a modern User-Agent and additional standard headers:

import pandas as pd

url = 'https://socialblade.com/youtube/top/category/news/mostviewed'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br'
}

# Fetch tables with browser-like headers
dfs = pd.read_html(url, headers=headers)
# The target channel table is the first one in the returned list
df = dfs[0]
print(df.head())

2. Fallback: Use Selenium for Dynamic Content

If the above fails, it's likely the table is loaded dynamically with JavaScript (which requests can't render). Selenium will launch a headless browser to fully load the page, ensuring you get the complete HTML with the table.

First, install Selenium:

pip install selenium

Then run this code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://socialblade.com/youtube/top/category/news/mostviewed'

# Configure headless Chrome to run in the background
chrome_options = Options()
chrome_options.add_argument('--headless=new')
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36')

# Launch the browser and load the page
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)

# Extract fully rendered page source and read tables
dfs = pd.read_html(driver.page_source)
df = dfs[0]

# Clean up: Close the browser
driver.quit()

print(df.head())

Why Your Original Code Failed

  • First attempt: No request headers were sent, so the server flagged your request as a bot and returned a 403 Forbidden error.
  • Second attempt: The older User-Agent and unnecessary X-Requested-With header may have gotten you past the 403, but the server returned an incomplete or non-standard HTML response that pandas.read_html couldn't parse for tables.

内容的提问来源于stack exchange,提问作者Arturo Sbr

火山引擎 最新活动