求助:如何爬取指定食谱网站中class为ingredients的食材区块?
How to Scrape Ingredients from the koket.se Recipe Page
Hey there! I’ve dealt with tricky recipe site scrapes before—let’s get those ingredients pulled for your collection.
First, let’s break down why targeting those article.ingredients blocks might be failing:
- Basic requests get blocked: Many sites flag unmodified
requestscalls as bots, so you need to mimic a real browser. - Dynamic content loading: The ingredients might load after the initial page via JavaScript, meaning static scrapers won’t pick them up right away.
Solution 1: Static Scraping with Proper Headers
If the ingredients are in the raw HTML (sometimes they are, even if it feels dynamic), adding a user-agent header can bypass basic bot detection. Here’s a Python example using requests and BeautifulSoup:
import requests from bs4 import BeautifulSoup url = "https://www.koket.se/halloumigryta-med-tomat-linser-och-chili" # Mimic a Chrome browser request headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36" } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, "html.parser") # Target both article blocks with the 'ingredients' class ingredients_blocks = soup.find_all("article", class_="ingredients") # Extract and print the ingredients from each block for idx, block in enumerate(ingredients_blocks, 1): print(f"Ingredients Block {idx}:\n") # Clean up the text to make it readable clean_ingredients = block.get_text(strip=True, separator="\n") print(clean_ingredients + "\n")
Solution 2: Dynamic Scraping with Selenium (for JS-Loaded Content)
If the ingredients don’t show up in the raw HTML, you’ll need to let JavaScript finish loading the page. Selenium mimics a real browser to handle this:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC url = "https://www.koket.se/halloumigryta-med-tomat-linser-och-chili" # Initialize Chrome driver (make sure you have ChromeDriver installed) driver = webdriver.Chrome() driver.get(url) # Wait up to 10 seconds for the ingredients blocks to load wait = WebDriverWait(driver, 10) ingredients_blocks = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "ingredients"))) # Extract and print the content for idx, block in enumerate(ingredients_blocks, 1): print(f"Ingredients Block {idx}:\n") print(block.text.strip() + "\n") driver.quit()
Quick Tips:
- Always check the site’s
robots.txt(you can find it athttps://www.koket.se/robots.txt) to confirm personal-use scraping is allowed. - Add small delays (
time.sleep(2)) between requests if you scrape multiple pages to avoid getting blocked. - If the class name ever changes, re-inspect the element in your browser’s dev tools to update your selector.
内容的提问来源于stack exchange,提问作者hejseb




