使用Selenium爬取淘宝时不同页面提取商品信息重复问题排查

阿华AIGC实验室

2026-5-28

Troubleshooting Taobao Scraping: Different Page Sources but Identical Extracted Data

Hey there, let's break down what's happening here and how to fix this confusing issue.

Why This Happens

The core problem boils down to timing: when you click the submit button to jump to page 2, you immediately grab the page_source before the page has fully loaded the new set of products.

Taobao uses dynamic front-end rendering (like most modern e-commerce platforms), meaning product data loads asynchronously after the basic page structure appears. So when you capture r2 right after clicking, the page's non-product elements (like pagination bar, header scripts, or ads) have updated (which is why r1 != r2), but the actual page 2 product data hasn't finished rendering yet. Your regex is still pulling the lingering page 1 product data from the DOM.

Fixes to Try

You need to add explicit waits to ensure the new page's products are fully loaded before capturing the source. Here are two reliable methods tailored to your code:

1. Wait for Product Elements to Load

Use Selenium's WebDriverWait to wait until the new page's product items appear in the DOM. This guarantees the product data has rendered:

# After clicking the submit button for page 2
d.click()

# Wait up to 10 seconds for a product item to load (adjust selector if needed)
WebDriverWait(browser, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '.item.J_MouserOnverReq .J_ClickStat'))
)

# Now grab the page source
r2 = browser.page_source

2. Wait for the Page Number to Update

Alternatively, wait until the pagination input box displays the correct page number, confirming the page switch is complete:

d.click()

# Wait until the input box value is "2"
WebDriverWait(browser, 10).until(
    EC.text_to_be_present_in_element_value((By.CSS_SELECTOR, '.input.J_Input'), '2')
)

r2 = browser.page_source

Bonus: Clean Up Redundant Code

You can refactor your repeated page-jumping logic into a function to avoid duplication and make the code easier to maintain:

def get_page_data(page_num):
    title_list = []
    price_list = []
    stall_list = []
    
    # Jump to target page
    c = browser.find_element_by_css_selector('.input.J_Input')
    c.clear()
    c.send_keys(str(page_num))
    d = browser.find_element_by_css_selector('.btn.J_Submit')
    d.click()
    
    # Wait for page to load
    WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '.item.J_MouserOnverReq'))
    )
    
    # Extract data
    page_source = browser.page_source
    title_list = re.findall('"raw_title":"(.*?)"', page_source, re.S)
    price_list = re.findall('"view_price":"(.*?)"', page_source, re.S)
    stall_list = re.findall('user_id.*?"nick":"(.*?)"', page_source, re.S)
    
    return title_list, price_list, stall_list

# Usage
title1, price1, stall1 = get_page_data(1)
title2, price2, stall2 = get_page_data(2)

After implementing these waits, your extracted data from page 1 and page 2 should be distinct as expected!

内容的提问来源于stack exchange，提问作者Bin