You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

爬取Foodpanda新加坡无限滚动页面时如何避免数据重复?

Hey there! Let's fix that duplicate data issue you're facing with your Foodpanda Singapore scraper. The core problem here is that every time you scroll and fetch restaurants, you're reprocessing the same ones you already saved. Here's how to fix this step by step:

1. Track Scraped Restaurants with Unique IDs

Foodpanda assigns a unique data-testid attribute to each restaurant list item (you can see this in your existing locator). We'll use this attribute to keep track of which restaurants we've already processed, so we don't write duplicates to your CSV.

Add this global set at the top of your code to store processed IDs:

scraped_ids = set()
header_added = False  # Keep your existing header flag

2. Modify Data Collection to Skip Duplicates

Update your get_data() function to check if a restaurant has already been scraped before processing it. This ensures only new entries get written to your CSV:

def get_data(rests):
    global header_added, scraped_ids
    for rest in rests:
        # Grab the unique ID of the restaurant
        rest_id = rest.get_attribute('data-testid')
        if rest_id in scraped_ids:
            continue  # Skip if we've already saved this restaurant
        scraped_ids.add(rest_id)  # Mark as processed
        
        # Your existing data extraction code goes here (keep this as-is)
        try:
            name = rest.find_element_by_xpath('.//span[@class="name fn"]').text
        except:
            name = 'No name'
        try:
            link_a = rest.find_element_by_xpath('.//a')
            link = link_a.get_attribute('href')
        except:
            link = 'No link available'
        try:
            rating = rest.find_element_by_xpath('.//span[@class="rating"]').text
            rating = rating[:-2]
        except:
            rating = 'No Ratings Available'
        try:
            cuisine = rest.find_element_by_xpath('.//ul[@class="categories summary"]').text
            cuisine = cuisine[4:]
        except:
            cuisine = 'Cuisine Details Not Available'
        try:
            distance = rest.find_element_by_xpath('.//span[@class="badge-info"]').text
        except:
            distance = "No Distance available"
        try:
            tags = rest.find_element_by_xpath('.//div[@class="tag-container"]').text
        except:
            tags = "No special Offers"
        try:
            cashback = rest.find_element_by_xpath('.//span[@class="vendor-cashback-info"]').text
        except:
            cashback = "No Cashback available"
        
        # Write to CSV only if it's a new restaurant
        dict1 = {'Restaurant Name': name, "Rating": rating, "Cuisine": cuisine, "Delivery Time": distance, "Tags": tags, "Cashback": cashback}
        with open(f'Food_Panda_test.csv', 'a+', encoding='utf-8-sig') as f:
            w = csv.DictWriter(f, dict1.keys())
            if not header_added:
                w.writeheader()
                header_added = True
            w.writerow(dict1)

3. Fix Scroll & Load Logic

Your current loop just re-fetches the same list over and over without scrolling. Let's add proper scroll behavior that waits for new elements to load, and stops when there's nothing left to scrape:

from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Make sure your driver initialization code is here (e.g., driver = webdriver.Chrome(...))

while True:
    # Get current list of restaurants
    current_restaurants = get_rest()
    # Process only new entries
    get_data(current_restaurants)
    
    # Scroll to load more restaurants
    try:
        # Scroll to the last restaurant to trigger lazy loading
        last_rest = current_restaurants[-1]
        driver.execute_script("arguments[0].scrollIntoView();", last_rest)
        
        # Wait for new restaurants to load (check if the list grows)
        WebDriverWait(driver, 10).until(
            lambda d: len(d.find_elements_by_xpath('//ul[@class="vendor-list"]//li[@data-testid and not(@class)]')) > len(current_restaurants)
        )
    except (IndexError, TimeoutException):
        # IndexError = no restaurants loaded, Timeout = no more restaurants to fetch
        print("No more restaurants to load or reached end of page.")
        break

4. Why Your Old Deduplication Code Failed

Your previous code used a locator //div[@class="q9uorilb"]//a which is specific to another website's DOM structure. Foodpanda uses a completely different class name and layout, so that code couldn't detect new elements here. The updated logic uses Foodpanda's actual restaurant list locator to check for new content.

Extra Tips

  • Replace time.sleep(): Swap the time.sleep(15) in get_rest() with a more reliable wait for the restaurant list to load:
    def get_rest():
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located(('xpath', '//ul[@class="vendor-list"]//li[@data-testid and not(@class)]'))
        )
        restaurant_locator = '//ul[@class="vendor-list"]//li[@data-testid and not(@class)]'
        restaurants = driver.find_elements_by_xpath(restaurant_locator)
        return restaurants
    
  • Adjust Wait Durations: If you're getting timeouts too early, increase the WebDriverWait duration from 10 to 15 seconds.

内容的提问来源于stack exchange,提问作者Abhishek Rai

火山引擎 最新活动