You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Selenium爬取given.lv珠宝产品图片时卡在着陆页的解决方案咨询

Hey, let's tackle this problem step by step. I've dealt with similar JS-heavy sites with anti-scraping measures, so here's what you can do to make your script work reliably for given.lv:

1. Reliable Navigation for given.lv with Selenium

The main issue is likely anti-bot detection or unhandled page elements (like cookie banners) blocking your script. Here's a modified, robust version of your code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import requests
import os

# Configure Chrome to avoid detection
options = Options()
# Spoof a real user-agent
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
# Disable Selenium's automation flags
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# Override the webdriver property to hide automation traces
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

# Navigate to landing page
driver.get("https://given.lv/")

# Handle cookie consent (critical for interacting with the site)
try:
    accept_cookie_btn = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'I agree')]"))
    )
    accept_cookie_btn.click()
except Exception as e:
    print("No cookie popup found or failed to click:", str(e))

# Navigate to collections page (adjust selector to match the site's actual navigation)
try:
    collection_link = WebDriverWait(driver, 15).until(
        EC.element_to_be_clickable((By.LINK_TEXT, "Collections"))
    )
    collection_link.click()
except Exception as e:
    print("Failed to find collections link:", str(e))
    driver.quit()
    exit()

# Wait for product list to load and extract links
product_links = WebDriverWait(driver, 20).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.product-item-link"))
)
product_urls = [link.get_attribute("href") for link in product_links]

# Create folder to save images
os.makedirs("given_lv_jewelry", exist_ok=True)

# Scrape each product page for images
for idx, url in enumerate(product_urls):
    driver.get(url)
    try:
        # Wait for product images to load
        product_images = WebDriverWait(driver, 15).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img.product-image-photo"))
        )
        # Download each image
        for img_idx, img in enumerate(product_images):
            img_src = img.get_attribute("src")
            if img_src:
                img_data = requests.get(img_src).content
                img_filename = f"given_lv_jewelry/product_{idx+1}_img_{img_idx+1}.jpg"
                with open(img_filename, 'wb') as f:
                    f.write(img_data)
                print(f"Saved: {img_filename}")
    except Exception as e:
        print(f"Failed to scrape product {url}:", str(e))
        continue

driver.quit()

Key Fixes in This Code:

  • Browser Fingerprint Spoofing: Disables Selenium's built-in automation markers and uses a real user-agent to avoid being flagged as a bot.
  • Cookie Consent Handling: Ensures you can interact with the site by accepting cookies first.
  • Explicit Waits: Replaces arbitrary time.sleep() with waits for critical elements, so the script only proceeds when the page is ready.
2. Best Practices for Scraping JS-Heavy Sites

When dealing with sites that rely heavily on JavaScript rendering, follow these guidelines to improve reliability:

  • Prioritize Explicit Waits: Always use WebDriverWait with expected conditions (e.g., EC.element_to_be_clickable, EC.presence_of_all_elements_located) instead of fixed delays. This adapts to varying load times.
  • Use Undetected Chromedriver: For sites with strict anti-bot measures, swap vanilla Selenium for undetected-chromedriver—it automatically bypasses most detection mechanisms.
  • Handle Dynamic Loading: For infinite-scroll or lazy-loaded content, implement a scroll-and-wait loop:
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait for new content to load
        WebDriverWait(driver, 5).until(
            lambda d: d.execute_script("return document.body.scrollHeight") > last_height
        )
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
    
  • Leverage DevTools Protocol (CDP): Intercept network requests to fetch image URLs directly without rendering the full page. This is faster and avoids dynamic loading issues:
    driver.execute_cdp_cmd('Network.enable', {})
    driver.execute_cdp_cmd('Network.setRequestInterception', {
        'patterns': [{'urlPattern': '*.jpg', 'resourceType': 'Image'}]
    })
    
    def intercept_request(request):
        img_url = request['request']['url']
        print(f"Found image: {img_url}")
        driver.execute_cdp_cmd('Network.continueInterceptedRequest', {
            'interceptionId': request['interceptionId']
        })
    
    driver.request_interceptor = intercept_request
    
  • Simulate Human Behavior: Add small random delays between actions, avoid rapid navigation, and mimic natural scrolling to reduce the chance of being blocked.
3. Additional Troubleshooting Tips
  • Check robots.txt: Visit https://given.lv/robots.txt to ensure you're allowed to scrape the site. Respect any disallowed paths to avoid legal issues.
  • Try Playwright: If Selenium continues to struggle, consider using Playwright. It has better built-in support for dynamic sites and anti-bot bypassing, with a more intuitive API.
  • Use Proxies: If you're getting IP-blocked, rotate proxies to distribute your requests across different IP addresses.
  • Inspect Network Requests: Use Chrome DevTools (F12) to check if the site exposes an API that returns product data/image URLs. Directly calling these APIs is far more efficient than scraping the frontend.

内容的提问来源于stack exchange,提问作者Samyak Jain

火山引擎 最新活动