You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何解决Scrapy爬取bom.gov.au网站时主URL修复后子URL仍返回403响应的问题

Fix 403 Error When Accessing Sub URLs in Scrapy for BOM Climate Updates

Nice job fixing the main URL's 403 by adding a User-Agent header! The problem with your sub-URLs is that those follow-up requests aren't carrying that same header—let's fix that with a couple of straightforward solutions:

Option 1: Reuse Headers Explicitly Across All Requests

The issue is that response.follow doesn't automatically inherit headers from the parent request unless you explicitly pass them. Storing your headers as a class attribute makes it easy to reuse them everywhere:

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class ClimateUpdateSpider(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']
    # Store headers as a class attribute for easy reuse
    custom_headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, headers=self.custom_headers)

    def parse(self, response):
        # Fixed XPath to scrape ALL article links (not just the first one)
        for link in response.xpath('//*[@id="content"]/ul/li/a/@href'):
            # Pass headers to follow() so sub-requests use them
            yield response.follow(url=link.get(), callback=self.parse_item, headers=self.custom_headers)

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
            'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
            'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess()
    # Fixed typo: reference your spider class correctly
    process.crawl(ClimateUpdateSpider)
    process.start()

A quick note: I adjusted your XPath in parse to grab all article links instead of just the first list item—you probably want to scrape more than one climate update! I also fixed a typo in your process.crawl() call (you had weeklymining instead of your spider class name).

Option 2: Set a Default User-Agent in Scrapy Settings (More Elegant)

For a cleaner approach, configure Scrapy to use your custom User-Agent for every request automatically. This way you don't have to pass headers manually for each request:

import scrapy
from scrapy.crawler import CrawlerProcess

class ClimateUpdateSpider(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']

    def parse(self, response):
        for link in response.xpath('//*[@id="content"]/ul/li/a/@href'):
            yield response.follow(url=link.get(), callback=self.parse_item)

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
            'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
            'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess(settings={
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0',
        # Optional: Enable cookies if the site requires them for validation
        'COOKIES_ENABLED': True
    })
    process.crawl(ClimateUpdateSpider)
    process.start()

Why This Works

Web servers often block requests with missing or suspicious User-Agent headers (flagging them as bots). Your initial fix added the valid header to the main request, but sub-requests used Scrapy's default User-Agent, which got blocked. By either reusing headers explicitly or setting a default in settings, all requests now carry a valid User-Agent that the server accepts.

If you still run into issues, you could try adding additional headers like Referer (set to the main URL) to mimic a real user's navigation more closely, but the User-Agent fix should resolve the 403 for this site.

内容的提问来源于stack exchange,提问作者nomnomyang

火山引擎 最新活动