求助：如何爬取指定食谱网站中class为ingredients的食材区块？

阿华AIGC实验室

2026-5-20

How to Scrape Ingredients from the koket.se Recipe Page

Hey there! I’ve dealt with tricky recipe site scrapes before—let’s get those ingredients pulled for your collection.

First, let’s break down why targeting those article.ingredients blocks might be failing:

Basic requests get blocked: Many sites flag unmodified requests calls as bots, so you need to mimic a real browser.
Dynamic content loading: The ingredients might load after the initial page via JavaScript, meaning static scrapers won’t pick them up right away.

Solution 1: Static Scraping with Proper Headers

If the ingredients are in the raw HTML (sometimes they are, even if it feels dynamic), adding a user-agent header can bypass basic bot detection. Here’s a Python example using requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = "https://www.koket.se/halloumigryta-med-tomat-linser-och-chili"
# Mimic a Chrome browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Target both article blocks with the 'ingredients' class
ingredients_blocks = soup.find_all("article", class_="ingredients")

# Extract and print the ingredients from each block
for idx, block in enumerate(ingredients_blocks, 1):
    print(f"Ingredients Block {idx}:\n")
    # Clean up the text to make it readable
    clean_ingredients = block.get_text(strip=True, separator="\n")
    print(clean_ingredients + "\n")

Solution 2: Dynamic Scraping with Selenium (for JS-Loaded Content)

If the ingredients don’t show up in the raw HTML, you’ll need to let JavaScript finish loading the page. Selenium mimics a real browser to handle this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.koket.se/halloumigryta-med-tomat-linser-och-chili"

# Initialize Chrome driver (make sure you have ChromeDriver installed)
driver = webdriver.Chrome()
driver.get(url)

# Wait up to 10 seconds for the ingredients blocks to load
wait = WebDriverWait(driver, 10)
ingredients_blocks = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "ingredients")))

# Extract and print the content
for idx, block in enumerate(ingredients_blocks, 1):
    print(f"Ingredients Block {idx}:\n")
    print(block.text.strip() + "\n")

driver.quit()

Quick Tips:

Always check the site’s robots.txt (you can find it at https://www.koket.se/robots.txt) to confirm personal-use scraping is allowed.
Add small delays (time.sleep(2)) between requests if you scrape multiple pages to avoid getting blocked.
If the class name ever changes, re-inspect the element in your browser’s dev tools to update your selector.

内容的提问来源于stack exchange，提问作者hejseb