如何解决Scrapy爬取bom.gov.au网站时主URL修复后子URL仍返回403响应的问题

阿华AIGC实验室

2026-4-27

Fix 403 Error When Accessing Sub URLs in Scrapy for BOM Climate Updates

Nice job fixing the main URL's 403 by adding a User-Agent header! The problem with your sub-URLs is that those follow-up requests aren't carrying that same header—let's fix that with a couple of straightforward solutions:

Option 1: Reuse Headers Explicitly Across All Requests

The issue is that response.follow doesn't automatically inherit headers from the parent request unless you explicitly pass them. Storing your headers as a class attribute makes it easy to reuse them everywhere:

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class ClimateUpdateSpider(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']
    # Store headers as a class attribute for easy reuse
    custom_headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, headers=self.custom_headers)

    def parse(self, response):
        # Fixed XPath to scrape ALL article links (not just the first one)
        for link in response.xpath('//*[@id="content"]/ul/li/a/@href'):
            # Pass headers to follow() so sub-requests use them
            yield response.follow(url=link.get(), callback=self.parse_item, headers=self.custom_headers)

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
            'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
            'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess()
    # Fixed typo: reference your spider class correctly
    process.crawl(ClimateUpdateSpider)
    process.start()

A quick note: I adjusted your XPath in parse to grab all article links instead of just the first list item—you probably want to scrape more than one climate update! I also fixed a typo in your process.crawl() call (you had weeklymining instead of your spider class name).

Option 2: Set a Default User-Agent in Scrapy Settings (More Elegant)

For a cleaner approach, configure Scrapy to use your custom User-Agent for every request automatically. This way you don't have to pass headers manually for each request:

import scrapy
from scrapy.crawler import CrawlerProcess

class ClimateUpdateSpider(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']

    def parse(self, response):
        for link in response.xpath('//*[@id="content"]/ul/li/a/@href'):
            yield response.follow(url=link.get(), callback=self.parse_item)

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
            'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
            'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess(settings={
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0',
        # Optional: Enable cookies if the site requires them for validation
        'COOKIES_ENABLED': True
    })
    process.crawl(ClimateUpdateSpider)
    process.start()

Why This Works

Web servers often block requests with missing or suspicious User-Agent headers (flagging them as bots). Your initial fix added the valid header to the main request, but sub-requests used Scrapy's default User-Agent, which got blocked. By either reusing headers explicitly or setting a default in settings, all requests now carry a valid User-Agent that the server accepts.

If you still run into issues, you could try adding additional headers like Referer (set to the main URL) to mimic a real user's navigation more closely, but the User-Agent fix should resolve the 403 for this site.

内容的提问来源于stack exchange，提问作者nomnomyang