Scrapy通用爬虫开发及Zyte部署咨询:批量抓取网站列表中各站点的外部链接
Solution for Your Scrapy Multi-Site External Link Crawler
Let's fix and enhance your code to meet all your requirements. Below is the complete implementation with explanations for each feature, plus deployment tips for Zyte.
Complete Working Code
First, install required dependencies:
pip install scrapy tldextract
Here's the revised spider:
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.crawler import CrawlerProcess import tldextract from urllib.parse import urlparse class ExternalLinkSpider(scrapy.Spider): name = 'external_link_crawler' # Configuration SITES_FILE = 'sites_list.txt' # One domain per line (e.g., https://example.com) BLOCKED_DOMAINS = {'facebook.com', 'instagram.com'} # Add domains to filter GLOBAL_DUPE_ENABLED = False # Set to True for site-wide deduplication custom_settings = { 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36', 'CONCURRENT_REQUESTS': 2, 'AUTO_THROTTLE_ENABLED': True, 'FEEDS': { 'external_links.csv': { 'format': 'csv', 'fields': ['source_domain', 'page_url', 'external_link'], 'overwrite': True } }, # For Zyte: Enable Smart Proxy Manager if needed 'DOWNLOADER_MIDDLEWARES': { 'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610, }, 'ZYTE_SMARTPROXY_APIKEY': 'YOUR_ZYTE_API_KEY', # Replace with your key } def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.link_extractor = LinkExtractor(unique=True) self.start_urls = self._load_sites_from_file() # For site-wide deduplication self.global_seen_externals = set() def _load_sites_from_file(self): """Load start URLs from the specified text file""" try: with open(self.SITES_FILE, 'r') as f: return [line.strip() for line in f if line.strip()] except FileNotFoundError: self.logger.error(f"File {self.SITES_FILE} not found!") return [] def _get_registered_domain(self, url): """Extract the registered domain (ignores subdomains)""" extracted = tldextract.extract(url) return f"{extracted.domain}.{extracted.suffix}" def parse(self, response): # Get the source domain for the current site being crawled source_domain = self._get_registered_domain(response.url) page_url = response.url # Track seen external links for this page (page-level deduplication) page_seen_externals = set() # Extract all links from the page for link in self.link_extractor.extract_links(response): link_domain = self._get_registered_domain(link.url) # Skip blocked domains if link_domain in self.BLOCKED_DOMAINS: continue # Check if it's an external link (different registered domain) if link_domain != source_domain: # Page-level deduplication if link.url not in page_seen_externals: page_seen_externals.add(link.url) # Optional site-wide deduplication if not self.GLOBAL_DUPE_ENABLED or link.url not in self.global_seen_externals: if self.GLOBAL_DUPE_ENABLED: self.global_seen_externals.add(link.url) # Yield the item for CSV export yield { 'source_domain': source_domain, 'page_url': page_url, 'external_link': link.url } else: # Follow internal links to crawl deeper yield response.follow(link.url, callback=self.parse) if __name__ == "__main__": process = CrawlerProcess() process.crawl(ExternalLinkSpider) process.start()
Key Feature Explanations
Let's map this code to your requirements:
Batch Crawl from Site List
- The
_load_sites_from_filemethod reads URLs fromsites_list.txt(one per line) to populatestart_urls. No more hardcoding!
- The
Internal vs External Link Detection
- Uses
tldextractto get the registered domain (e.g.,blog.example.comandexample.comboth map toexample.com). Links with a different registered domain are marked as external.
- Uses
CSV Export
- Uses Scrapy's built-in
FEEDSsetting to automatically export items toexternal_links.csvwith the exact fields you need. This is thread-safe and handles concurrency better than manual file writing.
- Uses Scrapy's built-in
Deduplication
- Page-level: Uses a
page_seen_externalsset to avoid duplicate links on the same page. - Site-wide: Toggle
GLOBAL_DUPE_ENABLEDtoTrueto useglobal_seen_externalsand avoid duplicates across the entire crawl.
- Page-level: Uses a
Filter Specific Domains
- Add domains to the
BLOCKED_DOMAINSset (e.g.,facebook.com) to skip those external links entirely.
- Add domains to the
Zyte Platform Deployment
Here are the key points for running this on Zyte:- Use Zyte Smart Proxy Manager: The code includes the middleware for this—replace
YOUR_ZYTE_API_KEYwith your actual key to handle anti-bot protections. - Project Structure: Zyte Scrapy Cloud expects a standard Scrapy project structure. To convert this script to a project:
- Run
scrapy startproject external_link_crawler - Replace the default
spiders/external_link_spider.pywith the code above. - Add
scrapy-zyte-smartproxytorequirements.txt.
- Run
- Environment Variables: Store your Zyte API key as an environment variable in Zyte instead of hardcoding it.
- Concurrency Settings: Adjust
CONCURRENT_REQUESTSbased on your Zyte plan limits—start low and increase if allowed. - Scheduling: Use Zyte's scheduling features to run crawls periodically or on-demand.
- Logging: Enable Scrapy logging to monitor crawls in Zyte's dashboard.
- Use Zyte Smart Proxy Manager: The code includes the middleware for this—replace
Fixes to Your Original Code
- Removed manual file writing (unreliable for concurrent crawls) in favor of Scrapy's
FEEDS. - Fixed hardcoded domain logic to handle subdomains correctly.
- Added site list loading from a file.
- Implemented proper deduplication and domain filtering.
- Added Zyte-specific middleware for proxy support.
Content of this question originates from Stack Exchange, asked by Alban




