You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Beautiful Soup获取网站商品SKU、价格并存储为DataFrame

Hey there! I feel your pain—spending hours stuck on a web scraping problem can be so frustrating. Let’s break this down step by step to get those prices, SKUs, and a clean DataFrame sorted out.

1. Extracting Price & SKU from a Single Product Page

First, let’s nail the core data extraction for one product.

Price Extraction

Your price lives in a <strong class="price" data-product="price"> tag—BeautifulSoup makes this straightforward with either find() or CSS selectors.

SKU Extraction

The SKU is tucked in a JSON snippet ("productSKU":"200341"). Most sites store this in an inline script tag (either JSON-LD or a product data variable), so we’ll hunt for that script and parse out the SKU.

Here’s a function to handle this:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import re
import time

def get_product_details(product_url):
    # Add a user-agent to avoid being blocked (replace with your own if needed)
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
    
    try:
        response = requests.get(product_url, headers=headers)
        response.raise_for_status()  # Raise error for bad status codes
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Extract price
        price_element = soup.find("strong", class_="price", attrs={"data-product": "price"})
        price = price_element.get_text(strip=True) if price_element else "N/A"
        
        # Extract SKU
        sku = "N/A"
        # Look through all script tags for the productSKU
        for script in soup.find_all("script"):
            if script.string and '"productSKU"' in script.string:
                # Try regex first for quick extraction
                sku_match = re.search(r'"productSKU":"(\w+)"', script.string)
                if sku_match:
                    sku = sku_match.group(1)
                    break
                # Fallback to parsing JSON if regex doesn't work (e.g., JSON-LD)
                try:
                    if script.get("type") == "application/ld+json":
                        product_json = json.loads(script.string)
                        sku = product_json.get("productID", "N/A")
                        break
                except json.JSONDecodeError:
                    continue
        
        return {"sku": sku, "price": price, "url": product_url}
    
    except Exception as e:
        print(f"Error scraping {product_url}: {str(e)}")
        return {"sku": "N/A", "price": "N/A", "url": product_url}

2. Scrape All "Acer" Search Pages

Next, we need to loop through all search result pages for "acer", grab each product’s link, and extract its details. You’ll need to adjust the selectors to match the actual structure of the search results page (e.g., product links, pagination indicators).

def scrape_all_acer_products(base_search_url):
    all_products = []
    page = 1
    
    while True:
        # Build the paginated URL (adjust the query param if your site uses a different format, like ?page=)
        paginated_url = f"{base_search_url}&page={page}"
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
        
        print(f"Scraping search page {page}: {paginated_url}")
        response = requests.get(paginated_url, headers=headers)
        
        # Stop if we hit a non-200 status or no products
        if response.status_code != 200:
            print("No more pages or access blocked. Stopping.")
            break
        
        soup = BeautifulSoup(response.text, "html.parser")
        # Find all product links (replace the selector with your site's actual product link class/attribute)
        product_links = soup.find_all("a", class_="product-item-link")
        
        if not product_links:
            print("No more products found. Stopping.")
            break
        
        # Scrape each product
        for link in product_links:
            product_url = link["href"]
            # Fix relative URLs
            if not product_url.startswith("http"):
                product_url = f"https://your-site-domain.com{product_url}"  # Replace with actual domain
            
            product_data = get_product_details(product_url)
            all_products.append(product_data)
            # Add a small delay to be polite to the server
            time.sleep(1)
        
        page += 1
    
    # Convert to DataFrame
    return pd.DataFrame(all_products)

3. Run the Scraper & Save Results

Finally, plug in your actual search URL and run the code:

# Replace with your site's actual search URL for "acer"
base_search_url = "https://your-site-domain.com/search?q=acer"

# Scrape all products
acer_df = scrape_all_acer_products(base_search_url)

# Check the results
print(acer_df.head())

# Save to CSV (or Excel, if you prefer)
acer_df.to_csv("acer_products_prices_skus.csv", index=False)

Quick Notes to Avoid Headaches:

  • Adjust Selectors: The class names (like product-item-link) are examples—you’ll need to inspect the search results page to find the correct ones for your target site.
  • Anti-Scraping Measures: Add a User-Agent header (like in the code) and small delays (time.sleep()) to avoid getting blocked. Some sites might require proxies if you’re scraping many pages.
  • Error Handling: The code includes basic error handling, but you can expand it to handle edge cases like missing prices/SKUs.

内容的提问来源于stack exchange,提问作者SJEL

火山引擎 最新活动