如何在Scrapy爬虫中为特定URL/域名设置代理?
解决Scrapy爬虫为特定域名设置代理的问题
首先得指出你遇到的核心问题:你的CoreSpider继承的是scrapy.Spider,但Rule里的process_request参数是CrawlSpider专属的特性,普通Spider不会调用这个方法,这就是为什么你看不到打印输出、代理设置失效的原因。
下面给你三种可行的解决方案,你可以根据自己的需求选择:
方案一:切换到CrawlSpider(适配你原来的Rule写法)
如果你想继续用Rule和process_request来管理爬取规则,需要把爬虫改成继承CrawlSpider,注意CrawlSpider的parse方法是用来分发规则回调的,不能直接覆盖,要把原来的解析逻辑放到新的回调函数里:
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor import glob import os import re # 替换成你实际的模块导入 from your_module import Extractor, ContentHandler_copy class CoreSpider(CrawlSpider): name = "final" def __init__(self): super().__init__() self.start_urls = self.read_url() # 定义爬取规则,process_request现在会被正确调用 rules = ( Rule( LinkExtractor(unique=True), callback='parse_item', follow=True, process_request='process_request' ), ) def read_url(self): urlList = [] for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')): with open(filename, "r") as f: for line in f.readlines(): url = re.sub('\\n', '', line) if "http" not in url: url = "http://" + url urlList.append(url) return urlList def process_request(self, request, spider): print("Request is : ", request.url) # 现在会正常输出了 # 替换成你要匹配的特定域名/URL规则 if 'xxx.com' in request.url: request.meta['proxy'] = 'https://159.8.18.178:8080' return request # 原来的parse逻辑移到这里,避免覆盖CrawlSpider的默认parse方法 def parse_item(self, response): print("URL is: ", response.url) print("User agent is : ", response.request.headers['User-Agent']) filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url article = Extractor(extractor='LargestContentExtractor', html=response.body).getText() print("Article is :", article) if len(article.split("\\n")) < 5: print("Skipping to next url : ", article.split("\\n")) else: print("Continue parsing: ", article.split("\\n")) ContentHandler_copy.ContentHandler_copy.start(article, response.url)
方案二:在普通Spider中手动处理请求(无需切换爬虫类型)
如果不想用CrawlSpider,可以封装一个请求创建方法,统一处理代理逻辑,不管是初始请求还是后续提取的链接都能复用:
import scrapy import glob import os import re from your_module import Extractor, ContentHandler_copy class CoreSpider(scrapy.Spider): name = "final" def start_requests(self): url_list = self.read_url() for url in url_list: yield self._build_request(url) def read_url(self): urlList = [] for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')): with open(filename, "r") as f: for line in f.readlines(): url = re.sub('\\n', '', line) if "http" not in url: url = "http://" + url urlList.append(url) return urlList # 封装请求创建逻辑,统一处理代理 def _build_request(self, url): request = scrapy.Request(url, callback=self.parse) # 匹配特定域名设置代理 if 'xxx.com' in url: request.meta['proxy'] = 'https://159.8.18.178:8080' return request def parse(self, response): # 原来的解析逻辑保持不变 print("URL is: ", response.url) print("User agent is : ", response.request.headers['User-Agent']) filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url article = Extractor(extractor='LargestContentExtractor', html=response.body).getText() print("Article is :", article) if len(article.split("\\n")) < 5: print("Skipping to next url : ", article.split("\\n")) else: print("Continue parsing: ", article.split("\\n")) ContentHandler_copy.ContentHandler_copy.start(article, response.url) # 手动提取页面链接,复用_build_request设置代理 for href in response.css('a::attr(href)').getall(): absolute_url = response.urljoin(href) yield self._build_request(absolute_url)
方案三:使用自定义下载中间件(最优雅,推荐)
如果想把代理逻辑和爬虫业务解耦,推荐写一个自定义下载中间件,全局拦截请求并为特定域名设置代理,不需要修改爬虫代码:
- 在项目根目录创建
middlewares.py(如果没有),添加如下代码:
class ProxyMiddleware: def process_request(self, request, spider): # 匹配你需要设置代理的域名 if 'xxx.com' in request.url: request.meta['proxy'] = 'https://159.8.18.178:8080'
- 在
settings.py中启用这个中间件:
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'random_useragent.RandomUserAgentMiddleware': 320, # 替换成你的项目名称,比如myproject.middlewares.ProxyMiddleware 'your_project_name.middlewares.ProxyMiddleware': 350, }
这样你的CoreSpider可以完全保持原来的写法,中间件会自动处理所有请求的代理设置。
内容的提问来源于stack exchange,提问作者Om Prakash




