You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何修复Scrapy Shell中Crawled (403)错误,获取200响应?

Got it, let's tackle this 403 Forbidden issue you're hitting in Scrapy Shell—super common when sites flag scrapers with basic anti-bot checks. Here are actionable fixes you can test right away:

Scrapy Shell 403 响应解决思路

1. 补全请求头,模拟真实浏览器行为

You already added a User-Agent, but most sites check more header fields to verify real traffic. Try expanding your headers with common browser values:

url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Referer": "https://www.urban.com.au/",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}
fet = scrapy.Request(url, headers=headers)
fetch(fet)

2. 配置全局Scrapy设置,强化模拟效果

Instead of adding headers per-request, set global configurations in the Shell to mimic human browsing patterns:

# 初始化并修改全局设置
from scrapy.settings import Settings
settings = Settings()
settings.set('USER_AGENT', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36')
settings.set('DOWNLOAD_DELAY', 2)  # 加2秒延迟,避免触发频率限制
settings.set('COOKIES_ENABLED', True)  # 允许携带Cookie,很多网站需要会话验证

# 应用设置到当前Shell环境
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings)

# 发起请求
url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank"
fetch(scrapy.Request(url, headers=headers))

3. 先获取网站Cookie,再请求目标页面

Some sites require valid session cookies to grant access. First request the homepage to get cookies, then hit your target URL:

# 先请求首页获取会话Cookie
fetch("https://www.urban.com.au/")

# 再请求目标页面,Shell会自动携带已获取的Cookie
url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
}
fetch(scrapy.Request(url, headers=headers))

4. 随机切换User-Agent规避特定UA拦截

If the site blocks static User-Agents, use the scrapy-fake-useragent library to generate realistic, random browser UAs:

# 先在终端安装依赖(如果未安装)
pip install scrapy-fake-useragent

Then in Scrapy Shell:

# 配置启用随机UA中间件
from scrapy.settings import Settings
settings = Settings()
settings.set('DOWNLOADER_MIDDLEWARES', {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
})

# 发起请求
url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank"
fetch(scrapy.Request(url))

5. 应对Cloudflare等高级反爬(用浏览器渲染)

If all above fails, the site likely uses Cloudflare or similar advanced anti-bot systems. Use scrapy-playwright to simulate real browser rendering:

# 安装依赖
pip install scrapy-playwright playwright
playwright install chrome

Then in Scrapy Shell:

# 配置Playwright中间件
from scrapy.settings import Settings
settings = Settings()
settings.set('DOWNLOADER_MIDDLEWARES', {'scrapy_playwright.middleware.PlaywrightMiddleware': 543})
settings.set('PLAYWRIGHT_LAUNCH_OPTIONS', {'headless': True})
settings.set('PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT', 10000)

# 用Playwright渲染请求
url = "https://www.urban.com.au/projects/melbourne-square-93-119-kavanagh-street-southbank"
fetch(scrapy.Request(url, meta={'playwright': True}))

内容的提问来源于stack exchange,提问作者MD Palash Babu

火山引擎 最新活动