You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何解决Scrapy-Splash中的504 HTTP网关超时问题

Fixing Scrapy-Splash 504 Gateway Timeout When Using a Proxy

First off, let's spot the most critical issue in your code: you've mixed up the parameter order for SplashRequest. The first positional argument for SplashRequest should always be the target url, but you've passed the args dictionary first. That's causing Scrapy-Splash to send a request to an invalid "URL" (the args dict itself), which explains the 504 timeout.

Let's walk through the fixes step by step:

1. Correct the SplashRequest Parameter Order

Here's how your request should be structured, with url as the first positional argument, followed by keyword arguments:

yield SplashRequest(
    url="http://www.google.ru/",
    callback=self.parse_splash,
    method="GET",
    endpoint='render.html',
    args={
        'wait': 0.5,
        'timeout': 30,  # Increased timeout since proxy access might be slower
        # Remove the 'proxy' arg here—you already configured it in Splash's proxy-profiles
    },
    splash_url="http://172.17.0.1:8050",
    headers=headers,
    dont_filter=True,
)

2. Avoid Proxy Configuration Conflicts

You've already set up your enterprise proxy in /etc/splash/proxy-profiles/default.ini, so there's no need to pass the proxy parameter in SplashRequest's args. Having both can lead to routing conflicts or unexpected proxy chaining, which might also contribute to timeouts.

3. Adjust Timeout Values

Your current timeout in args is set to 10 seconds, which might be too short for accessing a site like Google via an enterprise proxy. Bumping it to 30 seconds (as shown above) gives the request more time to go through the proxy and load the page.

4. Verify Headers and Request Context

Make sure your headers include a valid User-Agent and any other required headers that your enterprise proxy or the target site expects. Some proxies block requests with missing or suspicious headers, which could lead to timeouts.

5. Double-Check Scrapy-Splash Middleware Setup

Your custom_settings look correct for Scrapy-Splash, but just to confirm:

  • You've disabled the default CookiesMiddleware and replaced it with SplashCookiesMiddleware
  • SplashMiddleware is in the right position (725)
  • SplashDeduplicateArgsMiddleware is enabled in spider middlewares

If you've confirmed all these steps and still see timeouts, you can add more verbose logging to debug the issue:

  • Enable SPLASH_LOG_HTTP in your settings to see the raw requests/responses between Scrapy and Splash
  • Check the Splash Docker logs again while running the spider to see if there are any errors on the Splash side when handling the request

内容的提问来源于stack exchange,提问作者Eugene Moss

火山引擎 最新活动