Python新手技术求助:网页数据爬取时无法定位开发者工具可见标签的问题
Hey there! Let’s break down this super common issue you’re facing—seeing elements in your browser’s DevTools but not being able to scrape them with requests and BeautifulSoup. I’ve run into this exact problem with dynamic e-commerce sites like Mercari and Vinted, so let’s walk through the reasons and fixes.
Common Causes
1. JavaScript-Rendered Content
Modern sites like Mercari and Vinted use frontend frameworks (think React, Vue, or custom Web Components) that load their content dynamically after the initial HTML page loads. When you use requests.get(), you’re only fetching the raw, unprocessed HTML sent by the server—none of the content that gets added later by JavaScript.
For example:
- Mercari’s
mer-textis a custom Web Component, which doesn’t exist in the initial HTML. Your browser parses the JS and renders these elements after the page loads, but your scraper never runs that JS. - Vinted’s product listings are also injected into the page via JavaScript, so the initial HTML won’t have the span elements you’re looking for.
2. Missing or Unverified Request Headers
Many websites check request headers to block automated scrapers. The default requests library sends a User-Agent like python-requests/2.31.0, which is a dead giveaway that you’re not using a real browser. Servers might respond with stripped-down content or even empty pages when they detect this.
3. Anti-Scraping Measures
E-commerce platforms take anti-scraping seriously. They might use:
- IP rate limiting (blocking you if you request too often)
- Cookie-based authentication (requiring valid session cookies that your scraper doesn’t have)
- Bot detection tools (like Cloudflare) that require browser-like behavior to pass
Fixes to Try
1. Use a Headless Browser (Recommended for Dynamic Sites)
Headless browsers (like Playwright or Puppeteer) act like real browsers—they load JavaScript, render the page, and let you interact with it just like a human would. This is the most reliable way to scrape dynamic content.
Here’s a quick Playwright example for your Mercari scraper:
from playwright.sync_api import sync_playwright with sync_playwright() as p: # Launch a headless Chrome browser browser = p.chromium.launch(headless=True) page = browser.new_page( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36" ) # Navigate to the search page and wait for the elements to load page.goto("https://jp.mercari.com/search?keyword=pachinko") page.wait_for_selector("mer-text") # Wait until the elements exist # Extract text from each mer-text element for elem in page.query_selector_all("mer-text"): print(elem.inner_text()) browser.close()
2. Add Proper Request Headers
If the site only checks basic headers, you can mimic a real browser’s request headers to get the full content. Here’s how to update your Vinted code:
import requests from bs4 import BeautifulSoup # Mimic a real Chrome browser's headers headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" } url = "https://www.vinted.co.uk/vetements?search_text=pachinko" result = requests.get(url, headers=headers) doc = BeautifulSoup(result.text, "html.parser") # Instead of fetching all spans, target specific ones (use DevTools to find their classes/IDs) target_spans = doc.find_all("span", class_="item-title") # Replace with actual class from Vinted print(target_spans)
Note: This might not work for Vinted long-term since their content is heavily JS-rendered, but it’s worth testing for simpler sites.
3. Scrape the Direct API Endpoint
Many sites load their data via API calls (check your browser’s DevTools > Network tab > XHR/fetch requests). You can find the API URL that returns the product data in JSON format, then scrape that directly—this is faster and more reliable than parsing HTML.
For example, when you search Mercari, look for a request to an endpoint like https://jp.mercari.com/api/v1/items/search (the exact URL might vary). You can copy the request headers and parameters from DevTools, then use requests to call that API and get the raw JSON data.
Important Notes
- Always check the site’s
robots.txt(e.g.,https://jp.mercari.com/robots.txt) and Terms of Service to make sure scraping is allowed. - Add delays between requests (e.g.,
time.sleep(2)) to avoid triggering rate limits. - If you run into IP blocks, consider using a proxy pool to rotate your IP address.
内容的提问来源于stack exchange,提问作者Leng1




