You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用Scrapy-Playwright API爬取多页时重复获取第一页数据的问题求助

Scrapy-Playwright API爬取多页时重复获取第一页数据的问题求助

我是网页抓取的新手,最近尝试爬取一个本地电商网站(Daraz孟加拉站的Xbox游戏分类)。因为是动态网站,我用了Scrapy配合Playwright(Chromium),还加了代理。

一开始运行都正常,但当我尝试爬取多页的时候出问题了:我用了带不同页码的URL,但程序没有抓取不同页面,反而重复抓取第一页的数据。我怀疑是Playwright的问题,但不确定是代码写错了还是有bug。我试过用多进程、加/不加代理和User-Agent,结果都一样,实在搞不懂哪里出问题了...

我的代码

import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request

class ABCSpider(scrapy.Spider):
    name = "ABC"
    custom_settings = {
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
                ],
            },
        )

    async def parse(self, response):
        total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
        total_pages = int(total)   #total_pages = 4
        links = []
        for i in range(1, total_pages+1):
            a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
            links.append(a)
        
        for link in links:
            res = scrapy.Request(url=link, meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector",
                    '[class="box--ujueT"]'),
                ]})
            yield res and {
                "link" : response.url
            }

输出结果

[
    {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
    {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
    {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
    {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]

问题原因分析

你代码里的核心问题出在parse方法的最后一段:

yield res and {
    "link" : response.url
}

这里的response初始请求(第一页)的response,而不是你循环里每个link对应的新请求的response。而且yield res and {...}这种写法逻辑不对——你同时yield了请求对象和一个字典,但这个字典里始终用的是第一页的response.url,所以输出全是第一页的链接。

另外,你现在的写法是在parse里生成了新的请求,但这些请求的响应并没有被处理,你需要为这些新请求指定另一个回调函数来处理它们的响应,或者在parse里区分初始请求和后续分页请求。

修复方案

方案1:为分页请求指定回调函数

修改代码,让分页请求的响应被单独处理,这样就能获取到对应页面的URL:

import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request

class ABCSpider(scrapy.Spider):
    name = "ABC"
    custom_settings = {
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
                ],
            },
            callback=self.parse_initial  # 指定初始请求的回调
        )

    async def parse_initial(self, response):
        total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
        total_pages = int(total)   #total_pages = 4
        
        for i in range(1, total_pages+1):
            page_url = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
            yield scrapy.Request(
                url=page_url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", '[class="box--ujueT"]'),
                    ]
                },
                callback=self.parse_page  # 指定分页请求的回调
            )

    async def parse_page(self, response):
        # 这里处理每个分页的响应,比如提取商品数据,或者只返回当前页面的URL
        yield {
            "link": response.url
        }

方案2:在parse里区分请求类型(可选)

如果不想拆分回调函数,也可以在parse里判断当前请求是否是初始请求,不过拆分回调会更清晰:

async def parse(self, response):
    # 检查是否是初始请求(比如通过URL里的page=1,或者meta里加标记)
    if response.url.endswith('page=1'):
        total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
        total_pages = int(total)
        for i in range(1, total_pages+1):
            page_url = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
            yield scrapy.Request(
                url=page_url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", '[class="box--ujueT"]'),
                    ]
                }
            )
    else:
        # 处理分页响应
        yield {
            "link": response.url
        }

额外建议

  • 避免使用绝对路径的XPath(比如/html/body/div[3]/...),这种路径很脆弱,网站结构稍微变化就会失效。可以换成更稳定的相对路径,比如//ul[contains(@class, "pagination")]/li[last()-1]/a/text()(需要根据实际页面结构调整)。
  • 你可以在parse_page里添加实际的商品数据提取逻辑,而不只是返回页面URL。
  • 确认Playwright的页面等待逻辑正确,wait_for_selector里的选择器要确保是页面加载完成后才会出现的元素。

备注:内容来源于stack exchange,提问作者Sadik

火山引擎 最新活动