如何修复Scrapy Shell中Crawled (403)错误,获取200响应?
Got it, let's tackle this 403 Forbidden issue you're hitting in Scrapy Shell—super common when sites flag scrapers with basic anti-bot checks. Here are actionable fixes you can test right away:
1. 补全请求头,模拟真实浏览器行为
You already added a User-Agent, but most sites check more header fields to verify real traffic. Try expanding your headers with common browser values:
url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank" headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Referer": "https://www.urban.com.au/", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1" } fet = scrapy.Request(url, headers=headers) fetch(fet)
2. 配置全局Scrapy设置,强化模拟效果
Instead of adding headers per-request, set global configurations in the Shell to mimic human browsing patterns:
# 初始化并修改全局设置 from scrapy.settings import Settings settings = Settings() settings.set('USER_AGENT', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36') settings.set('DOWNLOAD_DELAY', 2) # 加2秒延迟,避免触发频率限制 settings.set('COOKIES_ENABLED', True) # 允许携带Cookie,很多网站需要会话验证 # 应用设置到当前Shell环境 from scrapy.crawler import CrawlerProcess process = CrawlerProcess(settings) # 发起请求 url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank" fetch(scrapy.Request(url, headers=headers))
3. 先获取网站Cookie,再请求目标页面
Some sites require valid session cookies to grant access. First request the homepage to get cookies, then hit your target URL:
# 先请求首页获取会话Cookie fetch("https://www.urban.com.au/") # 再请求目标页面,Shell会自动携带已获取的Cookie url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank" headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" } fetch(scrapy.Request(url, headers=headers))
4. 随机切换User-Agent规避特定UA拦截
If the site blocks static User-Agents, use the scrapy-fake-useragent library to generate realistic, random browser UAs:
# 先在终端安装依赖(如果未安装) pip install scrapy-fake-useragent
Then in Scrapy Shell:
# 配置启用随机UA中间件 from scrapy.settings import Settings settings = Settings() settings.set('DOWNLOADER_MIDDLEWARES', { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, }) # 发起请求 url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank" fetch(scrapy.Request(url))
5. 应对Cloudflare等高级反爬(用浏览器渲染)
If all above fails, the site likely uses Cloudflare or similar advanced anti-bot systems. Use scrapy-playwright to simulate real browser rendering:
# 安装依赖 pip install scrapy-playwright playwright playwright install chrome
Then in Scrapy Shell:
# 配置Playwright中间件 from scrapy.settings import Settings settings = Settings() settings.set('DOWNLOADER_MIDDLEWARES', {'scrapy_playwright.middleware.PlaywrightMiddleware': 543}) settings.set('PLAYWRIGHT_LAUNCH_OPTIONS', {'headless': True}) settings.set('PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT', 10000) # 用Playwright渲染请求 url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank" fetch(scrapy.Request(url, meta={'playwright': True}))
内容的提问来源于stack exchange,提问作者MD Palash Babu




