如何用Beautiful Soup获取网站商品SKU、价格并存储为DataFrame
Hey there! I feel your pain—spending hours stuck on a web scraping problem can be so frustrating. Let’s break this down step by step to get those prices, SKUs, and a clean DataFrame sorted out.
1. Extracting Price & SKU from a Single Product Page
First, let’s nail the core data extraction for one product.
Price Extraction
Your price lives in a <strong class="price" data-product="price"> tag—BeautifulSoup makes this straightforward with either find() or CSS selectors.
SKU Extraction
The SKU is tucked in a JSON snippet ("productSKU":"200341"). Most sites store this in an inline script tag (either JSON-LD or a product data variable), so we’ll hunt for that script and parse out the SKU.
Here’s a function to handle this:
import requests from bs4 import BeautifulSoup import json import pandas as pd import re import time def get_product_details(product_url): # Add a user-agent to avoid being blocked (replace with your own if needed) headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"} try: response = requests.get(product_url, headers=headers) response.raise_for_status() # Raise error for bad status codes soup = BeautifulSoup(response.text, "html.parser") # Extract price price_element = soup.find("strong", class_="price", attrs={"data-product": "price"}) price = price_element.get_text(strip=True) if price_element else "N/A" # Extract SKU sku = "N/A" # Look through all script tags for the productSKU for script in soup.find_all("script"): if script.string and '"productSKU"' in script.string: # Try regex first for quick extraction sku_match = re.search(r'"productSKU":"(\w+)"', script.string) if sku_match: sku = sku_match.group(1) break # Fallback to parsing JSON if regex doesn't work (e.g., JSON-LD) try: if script.get("type") == "application/ld+json": product_json = json.loads(script.string) sku = product_json.get("productID", "N/A") break except json.JSONDecodeError: continue return {"sku": sku, "price": price, "url": product_url} except Exception as e: print(f"Error scraping {product_url}: {str(e)}") return {"sku": "N/A", "price": "N/A", "url": product_url}
2. Scrape All "Acer" Search Pages
Next, we need to loop through all search result pages for "acer", grab each product’s link, and extract its details. You’ll need to adjust the selectors to match the actual structure of the search results page (e.g., product links, pagination indicators).
def scrape_all_acer_products(base_search_url): all_products = [] page = 1 while True: # Build the paginated URL (adjust the query param if your site uses a different format, like ?page=) paginated_url = f"{base_search_url}&page={page}" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"} print(f"Scraping search page {page}: {paginated_url}") response = requests.get(paginated_url, headers=headers) # Stop if we hit a non-200 status or no products if response.status_code != 200: print("No more pages or access blocked. Stopping.") break soup = BeautifulSoup(response.text, "html.parser") # Find all product links (replace the selector with your site's actual product link class/attribute) product_links = soup.find_all("a", class_="product-item-link") if not product_links: print("No more products found. Stopping.") break # Scrape each product for link in product_links: product_url = link["href"] # Fix relative URLs if not product_url.startswith("http"): product_url = f"https://your-site-domain.com{product_url}" # Replace with actual domain product_data = get_product_details(product_url) all_products.append(product_data) # Add a small delay to be polite to the server time.sleep(1) page += 1 # Convert to DataFrame return pd.DataFrame(all_products)
3. Run the Scraper & Save Results
Finally, plug in your actual search URL and run the code:
# Replace with your site's actual search URL for "acer" base_search_url = "https://your-site-domain.com/search?q=acer" # Scrape all products acer_df = scrape_all_acer_products(base_search_url) # Check the results print(acer_df.head()) # Save to CSV (or Excel, if you prefer) acer_df.to_csv("acer_products_prices_skus.csv", index=False)
Quick Notes to Avoid Headaches:
- Adjust Selectors: The class names (like
product-item-link) are examples—you’ll need to inspect the search results page to find the correct ones for your target site. - Anti-Scraping Measures: Add a
User-Agentheader (like in the code) and small delays (time.sleep()) to avoid getting blocked. Some sites might require proxies if you’re scraping many pages. - Error Handling: The code includes basic error handling, but you can expand it to handle edge cases like missing prices/SKUs.
内容的提问来源于stack exchange,提问作者SJEL




