Scrapy爬取Kogan.com遇分页难题：无href的next按钮及页码上限求助

阿华AIGC实验室

2026-5-12

解决Kogan.com爬取时“View more”按钮无href及页码限制的问题

这确实是网站的内容加载策略+反爬机制共同导致的：手动修改URL被限制在page=10，是因为网站后端做了页码上限拦截；而无href的“View more”按钮，本质是通过JavaScript异步加载后续内容，而非传统的URL跳转。下面给你几个实用的解决思路：

1. 抓包分析异步API（优先推荐）

这种方法效率最高，无需模拟浏览器动作。步骤如下：

打开浏览器开发者工具（F12），切换到「Network」面板，筛选「XHR/Fetch」类型的请求
打开目标页面https://www.kogan.com/au/shop/phones/?page=10，点击「View more」按钮
观察新出现的请求，找到返回商品列表数据的接口（通常是JSON格式）
分析这个接口的请求参数：你会发现它大概率不用page参数，而是用类似cursor、offset或者last_item_id这类标识来分页，用来告诉服务器“加载下一批从哪个位置开始的内容”
直接在Scrapy中构造这个API的请求，带上正确的请求头（比如User-Agent、Referer、Cookie，要和浏览器一致），循环请求直到接口返回空数据或者没有更多内容

举个简单的伪代码示例：

def parse_api(self, response):
    data = json.loads(response.text)
    # 处理商品数据
    for item in data['products']:
        yield self.parse_item(item)
    
    # 获取下一页的cursor参数
    next_cursor = data.get('next_cursor')
    if next_cursor:
        api_url = f"https://www.kogan.com/au/api/products?cursor={next_cursor}"
        yield scrapy.Request(api_url, callback=self.parse_api, headers=self.headers)

2. 用Selenium/Playwright模拟浏览器点击

如果API接口加密或者难以分析，就用模拟真实用户操作的方法：

在Scrapy项目中集成Selenium（或Playwright，后者更轻量）
用浏览器驱动打开目标页面，定位到带rel="next"的按钮（可以用XPath定位：//button[@rel="next"]）
调用点击方法，等待页面加载新内容（可以用显式等待，比如等待某个新商品元素出现）
抓取当前页面的商品数据，然后循环点击按钮直到按钮消失或不可点击

示例代码片段（Scrapy+Selenium）：

from scrapy import Spider
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class KoganSpider(Spider):
    name = 'kogan'
    start_urls = ['https://www.kogan.com/au/shop/phones/?page=10']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        
        while True:
            # 抓取当前页面的商品数据，这里假设用Scrapy的Selector处理
            selector = scrapy.Selector(text=self.driver.page_source)
            for product in selector.css('.product-item'):
                yield {
                    'name': product.css('.product-name::text').get(),
                    'price': product.css('.product-price::text').get()
                }
            
            # 尝试点击下一页按钮
            try:
                next_button = WebDriverWait(self.driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, '//button[@rel="next"]'))
                )
                next_button.click()
                # 等待新内容加载
                WebDriverWait(self.driver, 10).until(
                    EC.staleness_of(selector.css('.product-item:last-child').get())
                )
            except:
                # 没有更多按钮，退出循环
                break
        
        self.driver.quit()