Scrapy中CSS选择器为何使用.get()而非直接调用或.extract()？

阿华AIGC实验室

2026-5-8

Scrapy中获取下一页链接时.get()、直接调用SelectorList和.extract()的区别疑惑

大家好！首先感谢社区的支持，我是Python新手，正在学习Scrapy课程，希望吃透每一处代码细节。查阅Scrapy官方文档后还是有个疑惑：编写Scrapy Spider获取下一页链接时，为什么要用next_page = response.css('li.next a::attr(href)').get()，而不是直接调用response.css('li.next a::attr(href)')或者使用.extract()方法呢？

完整Spider代码

import scrapy
from ..items import QuotetutorialItem

class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        items = QuotetutorialItem()
        all_div_quotes = response.css('div.quote')
        for quotes in all_div_quotes:
            title = quotes.css('span.text::text').extract()
            author = quotes.css('.author::text').extract()
            tag = quotes.css('.tag::text').extract()
            items['title'] = title
            items['author'] = author
            items['tag'] = tag
            yield items

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

很高兴能帮你理清这个问题！咱们一个个来拆解：

直接调用response.css('li.next a::attr(href)')会得到什么？
这个调用返回的是一个SelectorList对象，它本质上是Scrapy封装的选择器列表，不是直接可用的字符串链接。如果你直接把它传给response.follow()，会报错，因为follow()需要的是字符串URL或者Request对象，而不是选择器列表。
那.extract()方法呢？
.extract()会把选择器列表里的所有结果提取出来，返回一个列表。比如在这个场景下，它会返回['/page/2/']这样的列表。如果你用这个列表去调用response.follow()，虽然Scrapy会尝试处理列表的第一个元素，但这样做不够严谨：如果页面没有下一页链接，.extract()会返回空列表，此时next_page就是[]，判断if next_page is not None:会永远为真，导致后续逻辑出错。而且列表类型也不符合我们只需要单个链接的需求，多此一举。
.get()方法的优势是什么？
.get()（也可以用.extract_first()，两者功能一致，.get()是更简洁的新写法）会从选择器列表里提取第一个匹配到的结果，如果没有匹配到任何内容，就返回None。这正好符合我们的需求：
- 当有下一页时，返回单个字符串链接'/page/2/'，可以直接传给response.follow()；
- 当没有下一页时，返回None，此时if next_page is not None:会跳过后续的请求，完美终止爬取循环。

另外，你代码里提取title、author、tag用的.extract()，其实也可以换成.getall()（和.extract()等价），或者根据需求用.get()获取单个值——比如每个quote的title只有一个，用.get()会得到字符串而不是列表，可能更符合Item字段的预期哦。

内容的提问来源于stack exchange，提问作者Andre Nevares