You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python新手技术求助:网页数据爬取时无法定位开发者工具可见标签的问题

Why Elements Visible in DevTools Aren’t Showing Up in Your Web Scrapers

Hey there! Let’s break down this super common issue you’re facing—seeing elements in your browser’s DevTools but not being able to scrape them with requests and BeautifulSoup. I’ve run into this exact problem with dynamic e-commerce sites like Mercari and Vinted, so let’s walk through the reasons and fixes.

Common Causes

1. JavaScript-Rendered Content

Modern sites like Mercari and Vinted use frontend frameworks (think React, Vue, or custom Web Components) that load their content dynamically after the initial HTML page loads. When you use requests.get(), you’re only fetching the raw, unprocessed HTML sent by the server—none of the content that gets added later by JavaScript.

For example:

  • Mercari’s mer-text is a custom Web Component, which doesn’t exist in the initial HTML. Your browser parses the JS and renders these elements after the page loads, but your scraper never runs that JS.
  • Vinted’s product listings are also injected into the page via JavaScript, so the initial HTML won’t have the span elements you’re looking for.

2. Missing or Unverified Request Headers

Many websites check request headers to block automated scrapers. The default requests library sends a User-Agent like python-requests/2.31.0, which is a dead giveaway that you’re not using a real browser. Servers might respond with stripped-down content or even empty pages when they detect this.

3. Anti-Scraping Measures

E-commerce platforms take anti-scraping seriously. They might use:

  • IP rate limiting (blocking you if you request too often)
  • Cookie-based authentication (requiring valid session cookies that your scraper doesn’t have)
  • Bot detection tools (like Cloudflare) that require browser-like behavior to pass

Fixes to Try

Headless browsers (like Playwright or Puppeteer) act like real browsers—they load JavaScript, render the page, and let you interact with it just like a human would. This is the most reliable way to scrape dynamic content.

Here’s a quick Playwright example for your Mercari scraper:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch a headless Chrome browser
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
    )
    
    # Navigate to the search page and wait for the elements to load
    page.goto("https://jp.mercari.com/search?keyword=pachinko")
    page.wait_for_selector("mer-text")  # Wait until the elements exist
    
    # Extract text from each mer-text element
    for elem in page.query_selector_all("mer-text"):
        print(elem.inner_text())
    
    browser.close()

2. Add Proper Request Headers

If the site only checks basic headers, you can mimic a real browser’s request headers to get the full content. Here’s how to update your Vinted code:

import requests
from bs4 import BeautifulSoup

# Mimic a real Chrome browser's headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
}

url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url, headers=headers)
doc = BeautifulSoup(result.text, "html.parser")

# Instead of fetching all spans, target specific ones (use DevTools to find their classes/IDs)
target_spans = doc.find_all("span", class_="item-title")  # Replace with actual class from Vinted
print(target_spans)

Note: This might not work for Vinted long-term since their content is heavily JS-rendered, but it’s worth testing for simpler sites.

3. Scrape the Direct API Endpoint

Many sites load their data via API calls (check your browser’s DevTools > Network tab > XHR/fetch requests). You can find the API URL that returns the product data in JSON format, then scrape that directly—this is faster and more reliable than parsing HTML.

For example, when you search Mercari, look for a request to an endpoint like https://jp.mercari.com/api/v1/items/search (the exact URL might vary). You can copy the request headers and parameters from DevTools, then use requests to call that API and get the raw JSON data.

Important Notes

  • Always check the site’s robots.txt (e.g., https://jp.mercari.com/robots.txt) and Terms of Service to make sure scraping is allowed.
  • Add delays between requests (e.g., time.sleep(2)) to avoid triggering rate limits.
  • If you run into IP blocks, consider using a proxy pool to rotate your IP address.

内容的提问来源于stack exchange,提问作者Leng1

火山引擎 最新活动