如何在Scrapy-Splash中保持页面存活以多次爬取JS页面?
Great question! The core issue here is that Splash treats each request as a fresh browser context by default—passing cookies alone preserves your HTTP session, but not the actual page state (like DOM elements, in-memory JS variables, or loaded resources). To avoid reloading the 7-second page every time, you need to reuse the same browser context, and Splash's session_id parameter is exactly what you need for this.
Why Your Original Script Failed
First, let's fix the immediate errors in your Lua code:
- The
last_responsevariable is only defined inside theif args.start_url == trueblock. When you run theelsebranch, trying to access it throws anundefinederror. - Without a persistent session, the
elsebranch runs in a brand-new, empty browser context—there's no loaded page to runevaljson, which explains the 400 bad request.
Fixed Lua Script with Session Support
Here's a revised script that uses session_id to maintain the same browser context across multiple requests, plus fixes the variable scope issue:
function main(splash, args) -- Initialize cookies if provided (for initial session setup) if args.cookies then splash:init_cookies(args.cookies) end local last_response = nil local js_result = nil if args.is_first_load then -- First request: load the page and wait for it to fully initialize assert(splash:go(args.url)) assert(splash:wait(7)) -- Match your page's load time -- Capture the initial response details local entries = splash:history() last_response = entries[#entries].response else -- Subsequent requests: run JS on the already-loaded page js_result = splash:evaljs(args.jscript) -- Grab the latest response from the session's history local entries = splash:history() last_response = entries[#entries].response end return { url = splash:url(), headers = last_response.headers, http_status = last_response.status, cookies = splash:get_cookies(), html = splash:html(), js_execution_result = js_result -- Return your JS output directly } end
How to Use This Script
First Request (Load the Page)
- Pass these parameters to Splash:
is_first_load=trueurl=https://your-target-page.comsession_id=my_unique_session_123(pick any unique string, keep it for all follow-ups)- Optional:
cookies(if you need to pre-authenticate)
- This will load the page, wait 7 seconds for it to fully load, and return the initial page state + cookies.
- Pass these parameters to Splash:
Subsequent Requests (Run JS)
- Reuse the same
session_idand pass:is_first_load=falsejscript=your_javascript_code_here(e.g.,document.querySelector('.data-element').textContent)
- Splash will reuse the existing browser context—no page reload needed. You'll get the JS execution result directly in the
js_execution_resultfield, along with updated page state.
- Reuse the same
Key Notes
- Session Timeout: Splash automatically cleans up inactive sessions after a default timeout (adjustable via the
session_timeoutconfig if needed). - Resource Management: Each session uses a browser instance, so don't leave unused sessions hanging for too long to avoid resource bloat.
- Wait Time: The
splash:wait(7)in the first request ensures your page's JS finishes initializing—tweak this if your page loads faster/slower.
内容的提问来源于stack exchange,提问作者Matts




