如何解决Scrapy爬取bom.gov.au网站时主URL修复后子URL仍返回403响应的问题
Nice job fixing the main URL's 403 by adding a User-Agent header! The problem with your sub-URLs is that those follow-up requests aren't carrying that same header—let's fix that with a couple of straightforward solutions:
Option 1: Reuse Headers Explicitly Across All Requests
The issue is that response.follow doesn't automatically inherit headers from the parent request unless you explicitly pass them. Storing your headers as a class attribute makes it easy to reuse them everywhere:
import scrapy from scrapy import Request from scrapy.crawler import CrawlerProcess class ClimateUpdateSpider(scrapy.Spider): name = 'climateupdate' start_urls = ['http://www.bom.gov.au/climate/updates/'] # Store headers as a class attribute for easy reuse custom_headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'} def start_requests(self): for url in self.start_urls: yield Request(url, headers=self.custom_headers) def parse(self, response): # Fixed XPath to scrape ALL article links (not just the first one) for link in response.xpath('//*[@id="content"]/ul/li/a/@href'): # Pass headers to follow() so sub-requests use them yield response.follow(url=link.get(), callback=self.parse_item, headers=self.custom_headers) def parse_item(self, response): yield { 'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(), 'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(), 'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')]) } if __name__ == '__main__': process = CrawlerProcess() # Fixed typo: reference your spider class correctly process.crawl(ClimateUpdateSpider) process.start()
A quick note: I adjusted your XPath in parse to grab all article links instead of just the first list item—you probably want to scrape more than one climate update! I also fixed a typo in your process.crawl() call (you had weeklymining instead of your spider class name).
Option 2: Set a Default User-Agent in Scrapy Settings (More Elegant)
For a cleaner approach, configure Scrapy to use your custom User-Agent for every request automatically. This way you don't have to pass headers manually for each request:
import scrapy from scrapy.crawler import CrawlerProcess class ClimateUpdateSpider(scrapy.Spider): name = 'climateupdate' start_urls = ['http://www.bom.gov.au/climate/updates/'] def parse(self, response): for link in response.xpath('//*[@id="content"]/ul/li/a/@href'): yield response.follow(url=link.get(), callback=self.parse_item) def parse_item(self, response): yield { 'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(), 'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(), 'text': ''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')]) } if __name__ == '__main__': process = CrawlerProcess(settings={ 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0', # Optional: Enable cookies if the site requires them for validation 'COOKIES_ENABLED': True }) process.crawl(ClimateUpdateSpider) process.start()
Why This Works
Web servers often block requests with missing or suspicious User-Agent headers (flagging them as bots). Your initial fix added the valid header to the main request, but sub-requests used Scrapy's default User-Agent, which got blocked. By either reusing headers explicitly or setting a default in settings, all requests now carry a valid User-Agent that the server accepts.
If you still run into issues, you could try adding additional headers like Referer (set to the main URL) to mimic a real user's navigation more closely, but the User-Agent fix should resolve the 403 for this site.
内容的提问来源于stack exchange,提问作者nomnomyang




