如何在Scrapy CrawlSpider中配置URL拒绝规则？

阿华AIGC实验室

2026-4-30

Fixing Scrapy LinkExtractor to Exclude /products, /collections Paths

Got it, let's get your Scrapy spider to stop following those unwanted /products and /collections URLs. The issue with your current setup is that you're not telling the LinkExtractor to reject those paths—right now it's pulling in all links from your allowed domains, which is why those URLs are still being processed.

Here's how to fix it, step by step:

1. Add the `deny` Parameter to LinkExtractor

The LinkExtractor has a built-in deny parameter that lets you specify regex patterns for URLs you want to exclude. Update your rules like this:

rules = (
    Rule(
        LinkExtractor(
            allow_domains=allowed_domains,
            deny=[r'/products/', r'/collections/']  # Reject URLs containing these paths
        ),
        callback='parse_page',
        process_links='process_links',
        follow=True
    ),
)

A quick note on the regex:

r'/products/' will match any URL that contains this substring, so it catches things like https://example.com/products/123, https://example.com/shop/products, etc.
If you need stricter matching (e.g., only URLs starting with /products), use r'^/products/' instead. Adjust based on your specific needs.

2. Check Your `process_links` Function

If you still see those unwanted URLs being processed after adding the deny rules, double-check your process_links function. It’s possible that function is modifying the links and reintroducing the URLs you tried to exclude.

Make sure process_links isn’t adding back links that match the /products or /collections patterns. For example, if you’re filtering links in that function, ensure it’s complementing the deny rules instead of overriding them.