Selenium爬取given.lv珠宝产品图片时卡在着陆页的解决方案咨询

阿华AIGC实验室

2026-4-27

Hey, let's tackle this problem step by step. I've dealt with similar JS-heavy sites with anti-scraping measures, so here's what you can do to make your script work reliably for given.lv:

The main issue is likely anti-bot detection or unhandled page elements (like cookie banners) blocking your script. Here's a modified, robust version of your code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import requests
import os

# Configure Chrome to avoid detection
options = Options()
# Spoof a real user-agent
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
# Disable Selenium's automation flags
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# Override the webdriver property to hide automation traces
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

# Navigate to landing page
driver.get("https://given.lv/")

# Handle cookie consent (critical for interacting with the site)
try:
    accept_cookie_btn = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'I agree')]"))
    )
    accept_cookie_btn.click()
except Exception as e:
    print("No cookie popup found or failed to click:", str(e))

# Navigate to collections page (adjust selector to match the site's actual navigation)
try:
    collection_link = WebDriverWait(driver, 15).until(
        EC.element_to_be_clickable((By.LINK_TEXT, "Collections"))
    )
    collection_link.click()
except Exception as e:
    print("Failed to find collections link:", str(e))
    driver.quit()
    exit()

# Wait for product list to load and extract links
product_links = WebDriverWait(driver, 20).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.product-item-link"))
)
product_urls = [link.get_attribute("href") for link in product_links]

# Create folder to save images
os.makedirs("given_lv_jewelry", exist_ok=True)

# Scrape each product page for images
for idx, url in enumerate(product_urls):
    driver.get(url)
    try:
        # Wait for product images to load
        product_images = WebDriverWait(driver, 15).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img.product-image-photo"))
        )
        # Download each image
        for img_idx, img in enumerate(product_images):
            img_src = img.get_attribute("src")
            if img_src:
                img_data = requests.get(img_src).content
                img_filename = f"given_lv_jewelry/product_{idx+1}_img_{img_idx+1}.jpg"
                with open(img_filename, 'wb') as f:
                    f.write(img_data)
                print(f"Saved: {img_filename}")
    except Exception as e:
        print(f"Failed to scrape product {url}:", str(e))
        continue

driver.quit()

Key Fixes in This Code:

Browser Fingerprint Spoofing: Disables Selenium's built-in automation markers and uses a real user-agent to avoid being flagged as a bot.
Cookie Consent Handling: Ensures you can interact with the site by accepting cookies first.
Explicit Waits: Replaces arbitrary time.sleep() with waits for critical elements, so the script only proceeds when the page is ready.

2. Best Practices for Scraping JS-Heavy Sites

When dealing with sites that rely heavily on JavaScript rendering, follow these guidelines to improve reliability:

Prioritize Explicit Waits: Always use WebDriverWait with expected conditions (e.g., EC.element_to_be_clickable, EC.presence_of_all_elements_located) instead of fixed delays. This adapts to varying load times.
Use Undetected Chromedriver: For sites with strict anti-bot measures, swap vanilla Selenium for undetected-chromedriver—it automatically bypasses most detection mechanisms.

Handle Dynamic Loading: For infinite-scroll or lazy-loaded content, implement a scroll-and-wait loop:

last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # Wait for new content to load
    WebDriverWait(driver, 5).until(
        lambda d: d.execute_script("return document.body.scrollHeight") > last_height
    )
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

Leverage DevTools Protocol (CDP): Intercept network requests to fetch image URLs directly without rendering the full page. This is faster and avoids dynamic loading issues:

driver.execute_cdp_cmd('Network.enable', {})
driver.execute_cdp_cmd('Network.setRequestInterception', {
    'patterns': [{'urlPattern': '*.jpg', 'resourceType': 'Image'}]
})

def intercept_request(request):
    img_url = request['request']['url']
    print(f"Found image: {img_url}")
    driver.execute_cdp_cmd('Network.continueInterceptedRequest', {
        'interceptionId': request['interceptionId']
    })

driver.request_interceptor = intercept_request

Simulate Human Behavior: Add small random delays between actions, avoid rapid navigation, and mimic natural scrolling to reduce the chance of being blocked.

3. Additional Troubleshooting Tips

Check robots.txt: Visit https://given.lv/robots.txt to ensure you're allowed to scrape the site. Respect any disallowed paths to avoid legal issues.
Try Playwright: If Selenium continues to struggle, consider using Playwright. It has better built-in support for dynamic sites and anti-bot bypassing, with a more intuitive API.
Use Proxies: If you're getting IP-blocked, rotate proxies to distribute your requests across different IP addresses.
Inspect Network Requests: Use Chrome DevTools (F12) to check if the site exposes an API that returns product data/image URLs. Directly calling these APIs is far more efficient than scraping the frontend.

内容的提问来源于stack exchange，提问作者Samyak Jain