You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Scrapy-Splash中保持页面存活以多次爬取JS页面?

Keep Your Page Alive in Scrapy-Splash for Repeated JavaScript Execution

Great question! The core issue here is that Splash treats each request as a fresh browser context by default—passing cookies alone preserves your HTTP session, but not the actual page state (like DOM elements, in-memory JS variables, or loaded resources). To avoid reloading the 7-second page every time, you need to reuse the same browser context, and Splash's session_id parameter is exactly what you need for this.

Why Your Original Script Failed

First, let's fix the immediate errors in your Lua code:

  • The last_response variable is only defined inside the if args.start_url == true block. When you run the else branch, trying to access it throws an undefined error.
  • Without a persistent session, the else branch runs in a brand-new, empty browser context—there's no loaded page to run evaljs on, which explains the 400 bad request.

Fixed Lua Script with Session Support

Here's a revised script that uses session_id to maintain the same browser context across multiple requests, plus fixes the variable scope issue:

function main(splash, args)
    -- Initialize cookies if provided (for initial session setup)
    if args.cookies then
        splash:init_cookies(args.cookies)
    end

    local last_response = nil
    local js_result = nil

    if args.is_first_load then
        -- First request: load the page and wait for it to fully initialize
        assert(splash:go(args.url))
        assert(splash:wait(7)) -- Match your page's load time

        -- Capture the initial response details
        local entries = splash:history()
        last_response = entries[#entries].response
    else
        -- Subsequent requests: run JS on the already-loaded page
        js_result = splash:evaljs(args.jscript)
        
        -- Grab the latest response from the session's history
        local entries = splash:history()
        last_response = entries[#entries].response
    end

    return {
        url = splash:url(),
        headers = last_response.headers,
        http_status = last_response.status,
        cookies = splash:get_cookies(),
        html = splash:html(),
        js_execution_result = js_result -- Return your JS output directly
    }
end

How to Use This Script

  1. First Request (Load the Page)

    • Pass these parameters to Splash:
      • is_first_load=true
      • url=https://your-target-page.com
      • session_id=my_unique_session_123 (pick any unique string, keep it for all follow-ups)
      • Optional: cookies (if you need to pre-authenticate)
    • This will load the page, wait 7 seconds for it to fully load, and return the initial page state + cookies.
  2. Subsequent Requests (Run JS)

    • Reuse the same session_id and pass:
      • is_first_load=false
      • jscript=your_javascript_code_here (e.g., document.querySelector('.data-element').textContent)
    • Splash will reuse the existing browser context—no page reload needed. You'll get the JS execution result directly in the js_execution_result field, along with updated page state.

Key Notes

  • Session Timeout: Splash automatically cleans up inactive sessions after a default timeout (adjustable via the session_timeout config if needed).
  • Resource Management: Each session uses a browser instance, so don't leave unused sessions hanging for too long to avoid resource bloat.
  • Wait Time: The splash:wait(7) in the first request ensures your page's JS finishes initializing—tweak this if your page loads faster/slower.

内容的提问来源于stack exchange,提问作者Matts

火山引擎 最新活动