爬取Foodpanda新加坡无限滚动页面时如何避免数据重复?
Hey there! Let's fix that duplicate data issue you're facing with your Foodpanda Singapore scraper. The core problem here is that every time you scroll and fetch restaurants, you're reprocessing the same ones you already saved. Here's how to fix this step by step:
1. Track Scraped Restaurants with Unique IDs
Foodpanda assigns a unique data-testid attribute to each restaurant list item (you can see this in your existing locator). We'll use this attribute to keep track of which restaurants we've already processed, so we don't write duplicates to your CSV.
Add this global set at the top of your code to store processed IDs:
scraped_ids = set() header_added = False # Keep your existing header flag
2. Modify Data Collection to Skip Duplicates
Update your get_data() function to check if a restaurant has already been scraped before processing it. This ensures only new entries get written to your CSV:
def get_data(rests): global header_added, scraped_ids for rest in rests: # Grab the unique ID of the restaurant rest_id = rest.get_attribute('data-testid') if rest_id in scraped_ids: continue # Skip if we've already saved this restaurant scraped_ids.add(rest_id) # Mark as processed # Your existing data extraction code goes here (keep this as-is) try: name = rest.find_element_by_xpath('.//span[@class="name fn"]').text except: name = 'No name' try: link_a = rest.find_element_by_xpath('.//a') link = link_a.get_attribute('href') except: link = 'No link available' try: rating = rest.find_element_by_xpath('.//span[@class="rating"]').text rating = rating[:-2] except: rating = 'No Ratings Available' try: cuisine = rest.find_element_by_xpath('.//ul[@class="categories summary"]').text cuisine = cuisine[4:] except: cuisine = 'Cuisine Details Not Available' try: distance = rest.find_element_by_xpath('.//span[@class="badge-info"]').text except: distance = "No Distance available" try: tags = rest.find_element_by_xpath('.//div[@class="tag-container"]').text except: tags = "No special Offers" try: cashback = rest.find_element_by_xpath('.//span[@class="vendor-cashback-info"]').text except: cashback = "No Cashback available" # Write to CSV only if it's a new restaurant dict1 = {'Restaurant Name': name, "Rating": rating, "Cuisine": cuisine, "Delivery Time": distance, "Tags": tags, "Cashback": cashback} with open(f'Food_Panda_test.csv', 'a+', encoding='utf-8-sig') as f: w = csv.DictWriter(f, dict1.keys()) if not header_added: w.writeheader() header_added = True w.writerow(dict1)
3. Fix Scroll & Load Logic
Your current loop just re-fetches the same list over and over without scrolling. Let's add proper scroll behavior that waits for new elements to load, and stops when there's nothing left to scrape:
from selenium.webdriver.common.keys import Keys from selenium.common.exceptions import TimeoutException from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Make sure your driver initialization code is here (e.g., driver = webdriver.Chrome(...)) while True: # Get current list of restaurants current_restaurants = get_rest() # Process only new entries get_data(current_restaurants) # Scroll to load more restaurants try: # Scroll to the last restaurant to trigger lazy loading last_rest = current_restaurants[-1] driver.execute_script("arguments[0].scrollIntoView();", last_rest) # Wait for new restaurants to load (check if the list grows) WebDriverWait(driver, 10).until( lambda d: len(d.find_elements_by_xpath('//ul[@class="vendor-list"]//li[@data-testid and not(@class)]')) > len(current_restaurants) ) except (IndexError, TimeoutException): # IndexError = no restaurants loaded, Timeout = no more restaurants to fetch print("No more restaurants to load or reached end of page.") break
4. Why Your Old Deduplication Code Failed
Your previous code used a locator //div[@class="q9uorilb"]//a which is specific to another website's DOM structure. Foodpanda uses a completely different class name and layout, so that code couldn't detect new elements here. The updated logic uses Foodpanda's actual restaurant list locator to check for new content.
Extra Tips
- Replace
time.sleep(): Swap thetime.sleep(15)inget_rest()with a more reliable wait for the restaurant list to load:def get_rest(): WebDriverWait(driver, 15).until( EC.presence_of_element_located(('xpath', '//ul[@class="vendor-list"]//li[@data-testid and not(@class)]')) ) restaurant_locator = '//ul[@class="vendor-list"]//li[@data-testid and not(@class)]' restaurants = driver.find_elements_by_xpath(restaurant_locator) return restaurants - Adjust Wait Durations: If you're getting timeouts too early, increase the
WebDriverWaitduration from 10 to 15 seconds.
内容的提问来源于stack exchange,提问作者Abhishek Rai




