使用pandas.read_html获取SocialBlade网站YouTube新闻类Top100高浏览量频道表格失败,求最简解决方案
Problem Description
I'm trying to use
pandas.read_htmlto scrape the table of YouTube's top 100 news channels by views from SocialBlade. First, I tried the direct approach:import pandas as pd df = pd.read_html('https://socialblade.com/youtube/top/category/news/mostviewed')But this threw an
HTTPError: HTTP Error 403: Forbidden. After checking related discussions, I tried adding browser headers withrequests:import requests import pandas as pd url = 'https://socialblade.com/youtube/top/category/news/mostviewed' header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"} df = pd.read_html(requests.get(url, headers=header).text)Now I get
ValueError: No tables found. What's the simplest way to convert this table into a pandas DataFrame?
Solution
1. Use pandas.read_html with Updated Headers (Simplest Static Approach)
The 403 error happens because SocialBlade blocks requests that don't mimic a real browser. Instead of using requests separately, you can pass proper headers directly to pandas.read_html—this keeps your code concise and avoids parsing issues with raw HTML.
Try using a modern User-Agent and additional standard headers:
import pandas as pd url = 'https://socialblade.com/youtube/top/category/news/mostviewed' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br' } # Fetch tables with browser-like headers dfs = pd.read_html(url, headers=headers) # The target channel table is the first one in the returned list df = dfs[0] print(df.head())
2. Fallback: Use Selenium for Dynamic Content
If the above fails, it's likely the table is loaded dynamically with JavaScript (which requests can't render). Selenium will launch a headless browser to fully load the page, ensuring you get the complete HTML with the table.
First, install Selenium:
pip install selenium
Then run this code:
from selenium import webdriver from selenium.webdriver.chrome.options import Options import pandas as pd url = 'https://socialblade.com/youtube/top/category/news/mostviewed' # Configure headless Chrome to run in the background chrome_options = Options() chrome_options.add_argument('--headless=new') chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36') # Launch the browser and load the page driver = webdriver.Chrome(options=chrome_options) driver.get(url) # Extract fully rendered page source and read tables dfs = pd.read_html(driver.page_source) df = dfs[0] # Clean up: Close the browser driver.quit() print(df.head())
Why Your Original Code Failed
- First attempt: No request headers were sent, so the server flagged your request as a bot and returned a 403 Forbidden error.
- Second attempt: The older User-Agent and unnecessary
X-Requested-Withheader may have gotten you past the 403, but the server returned an incomplete or non-standard HTML response thatpandas.read_htmlcouldn't parse for tables.
内容的提问来源于stack exchange,提问作者Arturo Sbr




