使用Selenium爬取淘宝时不同页面提取商品信息重复问题排查
Hey there, let's break down what's happening here and how to fix this confusing issue.
Why This Happens
The core problem boils down to timing: when you click the submit button to jump to page 2, you immediately grab the page_source before the page has fully loaded the new set of products.
Taobao uses dynamic front-end rendering (like most modern e-commerce platforms), meaning product data loads asynchronously after the basic page structure appears. So when you capture r2 right after clicking, the page's non-product elements (like pagination bar, header scripts, or ads) have updated (which is why r1 != r2), but the actual page 2 product data hasn't finished rendering yet. Your regex is still pulling the lingering page 1 product data from the DOM.
Fixes to Try
You need to add explicit waits to ensure the new page's products are fully loaded before capturing the source. Here are two reliable methods tailored to your code:
1. Wait for Product Elements to Load
Use Selenium's WebDriverWait to wait until the new page's product items appear in the DOM. This guarantees the product data has rendered:
# After clicking the submit button for page 2 d.click() # Wait up to 10 seconds for a product item to load (adjust selector if needed) WebDriverWait(browser, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, '.item.J_MouserOnverReq .J_ClickStat')) ) # Now grab the page source r2 = browser.page_source
2. Wait for the Page Number to Update
Alternatively, wait until the pagination input box displays the correct page number, confirming the page switch is complete:
d.click() # Wait until the input box value is "2" WebDriverWait(browser, 10).until( EC.text_to_be_present_in_element_value((By.CSS_SELECTOR, '.input.J_Input'), '2') ) r2 = browser.page_source
Bonus: Clean Up Redundant Code
You can refactor your repeated page-jumping logic into a function to avoid duplication and make the code easier to maintain:
def get_page_data(page_num): title_list = [] price_list = [] stall_list = [] # Jump to target page c = browser.find_element_by_css_selector('.input.J_Input') c.clear() c.send_keys(str(page_num)) d = browser.find_element_by_css_selector('.btn.J_Submit') d.click() # Wait for page to load WebDriverWait(browser, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, '.item.J_MouserOnverReq')) ) # Extract data page_source = browser.page_source title_list = re.findall('"raw_title":"(.*?)"', page_source, re.S) price_list = re.findall('"view_price":"(.*?)"', page_source, re.S) stall_list = re.findall('user_id.*?"nick":"(.*?)"', page_source, re.S) return title_list, price_list, stall_list # Usage title1, price1, stall1 = get_page_data(1) title2, price2, stall2 = get_page_data(2)
After implementing these waits, your extracted data from page 1 and page 2 should be distinct as expected!
内容的提问来源于stack exchange,提问作者Bin




