如何解决Scrapy-Splash中的504 HTTP网关超时问题
First off, let's spot the most critical issue in your code: you've mixed up the parameter order for SplashRequest. The first positional argument for SplashRequest should always be the target url, but you've passed the args dictionary first. That's causing Scrapy-Splash to send a request to an invalid "URL" (the args dict itself), which explains the 504 timeout.
Let's walk through the fixes step by step:
1. Correct the SplashRequest Parameter Order
Here's how your request should be structured, with url as the first positional argument, followed by keyword arguments:
yield SplashRequest( url="http://www.google.ru/", callback=self.parse_splash, method="GET", endpoint='render.html', args={ 'wait': 0.5, 'timeout': 30, # Increased timeout since proxy access might be slower # Remove the 'proxy' arg here—you already configured it in Splash's proxy-profiles }, splash_url="http://172.17.0.1:8050", headers=headers, dont_filter=True, )
2. Avoid Proxy Configuration Conflicts
You've already set up your enterprise proxy in /etc/splash/proxy-profiles/default.ini, so there's no need to pass the proxy parameter in SplashRequest's args. Having both can lead to routing conflicts or unexpected proxy chaining, which might also contribute to timeouts.
3. Adjust Timeout Values
Your current timeout in args is set to 10 seconds, which might be too short for accessing a site like Google via an enterprise proxy. Bumping it to 30 seconds (as shown above) gives the request more time to go through the proxy and load the page.
4. Verify Headers and Request Context
Make sure your headers include a valid User-Agent and any other required headers that your enterprise proxy or the target site expects. Some proxies block requests with missing or suspicious headers, which could lead to timeouts.
5. Double-Check Scrapy-Splash Middleware Setup
Your custom_settings look correct for Scrapy-Splash, but just to confirm:
- You've disabled the default
CookiesMiddlewareand replaced it withSplashCookiesMiddleware SplashMiddlewareis in the right position (725)SplashDeduplicateArgsMiddlewareis enabled in spider middlewares
If you've confirmed all these steps and still see timeouts, you can add more verbose logging to debug the issue:
- Enable
SPLASH_LOG_HTTPin your settings to see the raw requests/responses between Scrapy and Splash - Check the Splash Docker logs again while running the spider to see if there are any errors on the Splash side when handling the request
内容的提问来源于stack exchange,提问作者Eugene Moss




