使用Scrapy-Playwright API爬取多页时重复获取第一页数据的问题求助
Scrapy-Playwright API爬取多页时重复获取第一页数据的问题求助
我是网页抓取的新手,最近尝试爬取一个本地电商网站(Daraz孟加拉站的Xbox游戏分类)。因为是动态网站,我用了Scrapy配合Playwright(Chromium),还加了代理。
一开始运行都正常,但当我尝试爬取多页的时候出问题了:我用了带不同页码的URL,但程序没有抓取不同页面,反而重复抓取第一页的数据。我怀疑是Playwright的问题,但不确定是代码写错了还是有bug。我试过用多进程、加/不加代理和User-Agent,结果都一样,实在搞不懂哪里出问题了...
我的代码
import logging import scrapy from scrapy_playwright.page import PageMethod from helper import should_abort_request class ABCSpider(scrapy.Spider): name = "ABC" custom_settings = { 'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000', 'PLAYWRIGHT_ABORT_REQUEST': should_abort_request } def start_requests(self): yield scrapy.Request( url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1', meta={ "playwright": True, "playwright_include_page": True, "playwright_page_methods": [ PageMethod("wait_for_selector", '[class="box--LNmE6"]'), ], }, ) async def parse(self, response): total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0] total_pages = int(total) #total_pages = 4 links = [] for i in range(1, total_pages+1): a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i) links.append(a) for link in links: res = scrapy.Request(url=link, meta={ "playwright": True, "playwright_include_page": True, "playwright_page_methods": [ PageMethod("wait_for_selector", '[class="box--ujueT"]'), ]}) yield res and { "link" : response.url }
输出结果
[ {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}, {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}, {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}, {"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"} ]
问题原因分析
你代码里的核心问题出在parse方法的最后一段:
yield res and { "link" : response.url }
这里的response是初始请求(第一页)的response,而不是你循环里每个link对应的新请求的response。而且yield res and {...}这种写法逻辑不对——你同时yield了请求对象和一个字典,但这个字典里始终用的是第一页的response.url,所以输出全是第一页的链接。
另外,你现在的写法是在parse里生成了新的请求,但这些请求的响应并没有被处理,你需要为这些新请求指定另一个回调函数来处理它们的响应,或者在parse里区分初始请求和后续分页请求。
修复方案
方案1:为分页请求指定回调函数
修改代码,让分页请求的响应被单独处理,这样就能获取到对应页面的URL:
import logging import scrapy from scrapy_playwright.page import PageMethod from helper import should_abort_request class ABCSpider(scrapy.Spider): name = "ABC" custom_settings = { 'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000', 'PLAYWRIGHT_ABORT_REQUEST': should_abort_request } def start_requests(self): yield scrapy.Request( url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1', meta={ "playwright": True, "playwright_include_page": True, "playwright_page_methods": [ PageMethod("wait_for_selector", '[class="box--LNmE6"]'), ], }, callback=self.parse_initial # 指定初始请求的回调 ) async def parse_initial(self, response): total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0] total_pages = int(total) #total_pages = 4 for i in range(1, total_pages+1): page_url = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i) yield scrapy.Request( url=page_url, meta={ "playwright": True, "playwright_include_page": True, "playwright_page_methods": [ PageMethod("wait_for_selector", '[class="box--ujueT"]'), ] }, callback=self.parse_page # 指定分页请求的回调 ) async def parse_page(self, response): # 这里处理每个分页的响应,比如提取商品数据,或者只返回当前页面的URL yield { "link": response.url }
方案2:在parse里区分请求类型(可选)
如果不想拆分回调函数,也可以在parse里判断当前请求是否是初始请求,不过拆分回调会更清晰:
async def parse(self, response): # 检查是否是初始请求(比如通过URL里的page=1,或者meta里加标记) if response.url.endswith('page=1'): total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0] total_pages = int(total) for i in range(1, total_pages+1): page_url = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i) yield scrapy.Request( url=page_url, meta={ "playwright": True, "playwright_include_page": True, "playwright_page_methods": [ PageMethod("wait_for_selector", '[class="box--ujueT"]'), ] } ) else: # 处理分页响应 yield { "link": response.url }
额外建议
- 避免使用绝对路径的XPath(比如
/html/body/div[3]/...),这种路径很脆弱,网站结构稍微变化就会失效。可以换成更稳定的相对路径,比如//ul[contains(@class, "pagination")]/li[last()-1]/a/text()(需要根据实际页面结构调整)。 - 你可以在
parse_page里添加实际的商品数据提取逻辑,而不只是返回页面URL。 - 确认Playwright的页面等待逻辑正确,
wait_for_selector里的选择器要确保是页面加载完成后才会出现的元素。
备注:内容来源于stack exchange,提问作者Sadik




