You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Scrapy爬虫中为特定URL/域名设置代理?

解决Scrapy爬虫为特定域名设置代理的问题

首先得指出你遇到的核心问题:你的CoreSpider继承的是scrapy.Spider,但Rule里的process_request参数是CrawlSpider专属的特性,普通Spider不会调用这个方法,这就是为什么你看不到打印输出、代理设置失效的原因。

下面给你三种可行的解决方案,你可以根据自己的需求选择:


方案一:切换到CrawlSpider(适配你原来的Rule写法)

如果你想继续用Ruleprocess_request来管理爬取规则,需要把爬虫改成继承CrawlSpider,注意CrawlSpiderparse方法是用来分发规则回调的,不能直接覆盖,要把原来的解析逻辑放到新的回调函数里:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import glob
import os
import re
# 替换成你实际的模块导入
from your_module import Extractor, ContentHandler_copy  

class CoreSpider(CrawlSpider):
    name = "final"
    
    def __init__(self):
        super().__init__()
        self.start_urls = self.read_url()
    
    # 定义爬取规则,process_request现在会被正确调用
    rules = (
        Rule(
            LinkExtractor(unique=True), 
            callback='parse_item', 
            follow=True, 
            process_request='process_request'
        ),
    )
    
    def read_url(self):
        urlList = []
        for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
            with open(filename, "r") as f:
                for line in f.readlines():
                    url = re.sub('\\n', '', line)
                    if "http" not in url:
                        url = "http://" + url
                    urlList.append(url)
        return urlList
    
    def process_request(self, request, spider):
        print("Request is : ", request.url)  # 现在会正常输出了
        # 替换成你要匹配的特定域名/URL规则
        if 'xxx.com' in request.url: 
            request.meta['proxy'] = 'https://159.8.18.178:8080'
        return request
    
    # 原来的parse逻辑移到这里,避免覆盖CrawlSpider的默认parse方法
    def parse_item(self, response):
        print("URL is: ", response.url)
        print("User agent is : ", response.request.headers['User-Agent'])
        filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
        article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
        print("Article is :", article)
        if len(article.split("\\n")) < 5:
            print("Skipping to next url : ", article.split("\\n"))
        else:
            print("Continue parsing: ", article.split("\\n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)

方案二:在普通Spider中手动处理请求(无需切换爬虫类型)

如果不想用CrawlSpider,可以封装一个请求创建方法,统一处理代理逻辑,不管是初始请求还是后续提取的链接都能复用:

import scrapy
import glob
import os
import re
from your_module import Extractor, ContentHandler_copy

class CoreSpider(scrapy.Spider):
    name = "final"
    
    def start_requests(self):
        url_list = self.read_url()
        for url in url_list:
            yield self._build_request(url)
    
    def read_url(self):
        urlList = []
        for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
            with open(filename, "r") as f:
                for line in f.readlines():
                    url = re.sub('\\n', '', line)
                    if "http" not in url:
                        url = "http://" + url
                    urlList.append(url)
        return urlList
    
    # 封装请求创建逻辑,统一处理代理
    def _build_request(self, url):
        request = scrapy.Request(url, callback=self.parse)
        # 匹配特定域名设置代理
        if 'xxx.com' in url:
            request.meta['proxy'] = 'https://159.8.18.178:8080'
        return request
    
    def parse(self, response):
        # 原来的解析逻辑保持不变
        print("URL is: ", response.url)
        print("User agent is : ", response.request.headers['User-Agent'])
        filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
        article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
        print("Article is :", article)
        if len(article.split("\\n")) < 5:
            print("Skipping to next url : ", article.split("\\n"))
        else:
            print("Continue parsing: ", article.split("\\n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)
        
        # 手动提取页面链接,复用_build_request设置代理
        for href in response.css('a::attr(href)').getall():
            absolute_url = response.urljoin(href)
            yield self._build_request(absolute_url)

方案三:使用自定义下载中间件(最优雅,推荐)

如果想把代理逻辑和爬虫业务解耦,推荐写一个自定义下载中间件,全局拦截请求并为特定域名设置代理,不需要修改爬虫代码:

  1. 在项目根目录创建middlewares.py(如果没有),添加如下代码:
class ProxyMiddleware:
    def process_request(self, request, spider):
        # 匹配你需要设置代理的域名
        if 'xxx.com' in request.url:
            request.meta['proxy'] = 'https://159.8.18.178:8080'
  1. settings.py中启用这个中间件:
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 320,
    # 替换成你的项目名称,比如myproject.middlewares.ProxyMiddleware
    'your_project_name.middlewares.ProxyMiddleware': 350,
}

这样你的CoreSpider可以完全保持原来的写法,中间件会自动处理所有请求的代理设置。


内容的提问来源于stack exchange,提问作者Om Prakash

火山引擎 最新活动