You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何通过Scrapy+Splash抓取JS渲染后的目标HTML代码?

Troubleshooting Scrapy + Splash Not Returning Dynamically Rendered HTML

Hey there, let’s break down why your current setup isn’t capturing that dynamic HTML you can see in your browser’s dev tools, and walk through fixes to get it working.

First, the core issue here is that even though Splash is supposed to execute JavaScript, your request isn’t giving it enough time or the right configuration to wait for the dynamic content to load. Here’s what to check and adjust:

1. Add a Wait Time for AJAX/JS Content

Most dynamic content loads asynchronously after the initial page render. Your current SplashRequest doesn’t include a wait parameter, so Splash might be returning the page before the target HTML gets injected.

Fix this by using a custom Lua script to have precise control over the wait period:

# Update your start_requests method with this
def start_requests(self):
    # Lua script to navigate, wait for JS, and return full rendered HTML
    lua_script = """
    function main(splash, args)
        splash:go(args.url)
        -- Wait 5 seconds for AJAX calls to finish and content to render
        splash:wait(5)
        return splash:html()
    end
    """
    yield SplashRequest(
        url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
        callback=self.parse,
        endpoint='execute',
        args={'lua_source': lua_script}
    )

2. Spoof a Realistic User Agent

Many sites block or serve different content to non-browser user agents. Splash uses a default UA that’s easy to detect, so override it to match a real browser:

# Update the Lua script to include a valid user agent
lua_script = """
function main(splash, args)
    splash:set_user_agent(args.ua)
    splash:go(args.url)
    splash:wait(5)
    return splash:html()
end
"""

yield SplashRequest(
    # ... other parameters ...
    args={
        'lua_source': lua_script,
        'ua': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }
)

3. Verify Splash is Running Correctly

Double-check that your Splash instance at http://localhost:8050 is working:

  • Visit the Splash web UI directly, paste your target URL, and see if the rendered page includes your desired HTML.
  • If it doesn’t, your Splash service might be missing dependencies (like Firefox/Xvfb) or need an update.

4. Add Common Browser Headers

Some sites use headers like Referer or Accept to validate requests. Add these to make your request look more like a real browser’s:

# Add headers to your SplashRequest
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://www.gaslicht.com/'
}

yield SplashRequest(
    # ... other parameters ...
    headers=headers
)

Modified Full Code Example

Here’s your updated spider with all these fixes applied:

import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup

class NetherSplashSpider(scrapy.Spider):
    name = 'nether_splash'
    download_delay = 10
    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        lua_script = """
        function main(splash, args)
            splash:set_user_agent(args.ua)
            splash:go(args.url)
            splash:wait(5)
            return splash:html()
        end
        """
        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Referer': 'https://www.gaslicht.com/'
        }
        yield SplashRequest(
            url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
            callback=self.parse,
            endpoint='execute',
            args={
                'lua_source': lua_script,
                'ua': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
            },
            headers=headers
        )

    def parse(self, response):
        filename = 'splash.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        
        # Quick check to verify target content exists
        soup = BeautifulSoup(response.body, 'html.parser')
        # Replace 'your-target-selector' with the actual CSS/XPath selector for your content
        target = soup.select_one('your-target-selector')
        if target:
            self.logger.info(f"Found target content: {target.get_text(strip=True)}")
        else:
            self.logger.warning("Target content still missing - might need longer wait time or additional headers")

Final Notes

If you still can’t get the content, inspect your browser’s network tab to see if the target data is loaded via a separate AJAX API call. Sometimes it’s more efficient to call that API directly with Scrapy instead of rendering the entire page with Splash.

内容的提问来源于stack exchange,提问作者pap

火山引擎 最新活动