You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Scrapy通用爬虫开发及Zyte部署咨询:批量抓取网站列表中各站点的外部链接

Let's fix and enhance your code to meet all your requirements. Below is the complete implementation with explanations for each feature, plus deployment tips for Zyte.

Complete Working Code

First, install required dependencies:

pip install scrapy tldextract

Here's the revised spider:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
import tldextract
from urllib.parse import urlparse

class ExternalLinkSpider(scrapy.Spider):
    name = 'external_link_crawler'
    
    # Configuration
    SITES_FILE = 'sites_list.txt'  # One domain per line (e.g., https://example.com)
    BLOCKED_DOMAINS = {'facebook.com', 'instagram.com'}  # Add domains to filter
    GLOBAL_DUPE_ENABLED = False  # Set to True for site-wide deduplication

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
        'CONCURRENT_REQUESTS': 2,
        'AUTO_THROTTLE_ENABLED': True,
        'FEEDS': {
            'external_links.csv': {
                'format': 'csv',
                'fields': ['source_domain', 'page_url', 'external_link'],
                'overwrite': True
            }
        },
        # For Zyte: Enable Smart Proxy Manager if needed
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610,
        },
        'ZYTE_SMARTPROXY_APIKEY': 'YOUR_ZYTE_API_KEY',  # Replace with your key
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.link_extractor = LinkExtractor(unique=True)
        self.start_urls = self._load_sites_from_file()
        # For site-wide deduplication
        self.global_seen_externals = set()

    def _load_sites_from_file(self):
        """Load start URLs from the specified text file"""
        try:
            with open(self.SITES_FILE, 'r') as f:
                return [line.strip() for line in f if line.strip()]
        except FileNotFoundError:
            self.logger.error(f"File {self.SITES_FILE} not found!")
            return []

    def _get_registered_domain(self, url):
        """Extract the registered domain (ignores subdomains)"""
        extracted = tldextract.extract(url)
        return f"{extracted.domain}.{extracted.suffix}"

    def parse(self, response):
        # Get the source domain for the current site being crawled
        source_domain = self._get_registered_domain(response.url)
        page_url = response.url

        # Track seen external links for this page (page-level deduplication)
        page_seen_externals = set()

        # Extract all links from the page
        for link in self.link_extractor.extract_links(response):
            link_domain = self._get_registered_domain(link.url)
            
            # Skip blocked domains
            if link_domain in self.BLOCKED_DOMAINS:
                continue

            # Check if it's an external link (different registered domain)
            if link_domain != source_domain:
                # Page-level deduplication
                if link.url not in page_seen_externals:
                    page_seen_externals.add(link.url)
                    # Optional site-wide deduplication
                    if not self.GLOBAL_DUPE_ENABLED or link.url not in self.global_seen_externals:
                        if self.GLOBAL_DUPE_ENABLED:
                            self.global_seen_externals.add(link.url)
                        # Yield the item for CSV export
                        yield {
                            'source_domain': source_domain,
                            'page_url': page_url,
                            'external_link': link.url
                        }
            else:
                # Follow internal links to crawl deeper
                yield response.follow(link.url, callback=self.parse)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ExternalLinkSpider)
    process.start()

Key Feature Explanations

Let's map this code to your requirements:

  1. Batch Crawl from Site List

    • The _load_sites_from_file method reads URLs from sites_list.txt (one per line) to populate start_urls. No more hardcoding!
  2. Internal vs External Link Detection

    • Uses tldextract to get the registered domain (e.g., blog.example.com and example.com both map to example.com). Links with a different registered domain are marked as external.
  3. CSV Export

    • Uses Scrapy's built-in FEEDS setting to automatically export items to external_links.csv with the exact fields you need. This is thread-safe and handles concurrency better than manual file writing.
  4. Deduplication

    • Page-level: Uses a page_seen_externals set to avoid duplicate links on the same page.
    • Site-wide: Toggle GLOBAL_DUPE_ENABLED to True to use global_seen_externals and avoid duplicates across the entire crawl.
  5. Filter Specific Domains

    • Add domains to the BLOCKED_DOMAINS set (e.g., facebook.com) to skip those external links entirely.
  6. Zyte Platform Deployment
    Here are the key points for running this on Zyte:

    • Use Zyte Smart Proxy Manager: The code includes the middleware for this—replace YOUR_ZYTE_API_KEY with your actual key to handle anti-bot protections.
    • Project Structure: Zyte Scrapy Cloud expects a standard Scrapy project structure. To convert this script to a project:
      1. Run scrapy startproject external_link_crawler
      2. Replace the default spiders/external_link_spider.py with the code above.
      3. Add scrapy-zyte-smartproxy to requirements.txt.
    • Environment Variables: Store your Zyte API key as an environment variable in Zyte instead of hardcoding it.
    • Concurrency Settings: Adjust CONCURRENT_REQUESTS based on your Zyte plan limits—start low and increase if allowed.
    • Scheduling: Use Zyte's scheduling features to run crawls periodically or on-demand.
    • Logging: Enable Scrapy logging to monitor crawls in Zyte's dashboard.

Fixes to Your Original Code

  • Removed manual file writing (unreliable for concurrent crawls) in favor of Scrapy's FEEDS.
  • Fixed hardcoded domain logic to handle subdomains correctly.
  • Added site list loading from a file.
  • Implemented proper deduplication and domain filtering.
  • Added Zyte-specific middleware for proxy support.

Content of this question originates from Stack Exchange, asked by Alban

火山引擎 最新活动