Selenium爬取given.lv珠宝产品图片时卡在着陆页的解决方案咨询
Hey, let's tackle this problem step by step. I've dealt with similar JS-heavy sites with anti-scraping measures, so here's what you can do to make your script work reliably for given.lv:
The main issue is likely anti-bot detection or unhandled page elements (like cookie banners) blocking your script. Here's a modified, robust version of your code:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from webdriver_manager.chrome import ChromeDriverManager import requests import os # Configure Chrome to avoid detection options = Options() # Spoof a real user-agent options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36") # Disable Selenium's automation flags options.add_argument("--disable-blink-features=AutomationControlled") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) # Override the webdriver property to hide automation traces driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})") # Navigate to landing page driver.get("https://given.lv/") # Handle cookie consent (critical for interacting with the site) try: accept_cookie_btn = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept') or contains(text(), 'I agree')]")) ) accept_cookie_btn.click() except Exception as e: print("No cookie popup found or failed to click:", str(e)) # Navigate to collections page (adjust selector to match the site's actual navigation) try: collection_link = WebDriverWait(driver, 15).until( EC.element_to_be_clickable((By.LINK_TEXT, "Collections")) ) collection_link.click() except Exception as e: print("Failed to find collections link:", str(e)) driver.quit() exit() # Wait for product list to load and extract links product_links = WebDriverWait(driver, 20).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.product-item-link")) ) product_urls = [link.get_attribute("href") for link in product_links] # Create folder to save images os.makedirs("given_lv_jewelry", exist_ok=True) # Scrape each product page for images for idx, url in enumerate(product_urls): driver.get(url) try: # Wait for product images to load product_images = WebDriverWait(driver, 15).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img.product-image-photo")) ) # Download each image for img_idx, img in enumerate(product_images): img_src = img.get_attribute("src") if img_src: img_data = requests.get(img_src).content img_filename = f"given_lv_jewelry/product_{idx+1}_img_{img_idx+1}.jpg" with open(img_filename, 'wb') as f: f.write(img_data) print(f"Saved: {img_filename}") except Exception as e: print(f"Failed to scrape product {url}:", str(e)) continue driver.quit()
Key Fixes in This Code:
- Browser Fingerprint Spoofing: Disables Selenium's built-in automation markers and uses a real user-agent to avoid being flagged as a bot.
- Cookie Consent Handling: Ensures you can interact with the site by accepting cookies first.
- Explicit Waits: Replaces arbitrary
time.sleep()with waits for critical elements, so the script only proceeds when the page is ready.
2. Best Practices for Scraping JS-Heavy Sites
When dealing with sites that rely heavily on JavaScript rendering, follow these guidelines to improve reliability:
- Prioritize Explicit Waits: Always use
WebDriverWaitwith expected conditions (e.g.,EC.element_to_be_clickable,EC.presence_of_all_elements_located) instead of fixed delays. This adapts to varying load times. - Use Undetected Chromedriver: For sites with strict anti-bot measures, swap vanilla Selenium for
undetected-chromedriver—it automatically bypasses most detection mechanisms. - Handle Dynamic Loading: For infinite-scroll or lazy-loaded content, implement a scroll-and-wait loop:
last_height = driver.execute_script("return document.body.scrollHeight") while True: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait for new content to load WebDriverWait(driver, 5).until( lambda d: d.execute_script("return document.body.scrollHeight") > last_height ) new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height - Leverage DevTools Protocol (CDP): Intercept network requests to fetch image URLs directly without rendering the full page. This is faster and avoids dynamic loading issues:
driver.execute_cdp_cmd('Network.enable', {}) driver.execute_cdp_cmd('Network.setRequestInterception', { 'patterns': [{'urlPattern': '*.jpg', 'resourceType': 'Image'}] }) def intercept_request(request): img_url = request['request']['url'] print(f"Found image: {img_url}") driver.execute_cdp_cmd('Network.continueInterceptedRequest', { 'interceptionId': request['interceptionId'] }) driver.request_interceptor = intercept_request - Simulate Human Behavior: Add small random delays between actions, avoid rapid navigation, and mimic natural scrolling to reduce the chance of being blocked.
3. Additional Troubleshooting Tips
- Check
robots.txt: Visithttps://given.lv/robots.txtto ensure you're allowed to scrape the site. Respect any disallowed paths to avoid legal issues. - Try Playwright: If Selenium continues to struggle, consider using Playwright. It has better built-in support for dynamic sites and anti-bot bypassing, with a more intuitive API.
- Use Proxies: If you're getting IP-blocked, rotate proxies to distribute your requests across different IP addresses.
- Inspect Network Requests: Use Chrome DevTools (F12) to check if the site exposes an API that returns product data/image URLs. Directly calling these APIs is far more efficient than scraping the frontend.
内容的提问来源于stack exchange,提问作者Samyak Jain




