如何通过Scrapy+Splash抓取JS渲染后的目标HTML代码?
Hey there, let’s break down why your current setup isn’t capturing that dynamic HTML you can see in your browser’s dev tools, and walk through fixes to get it working.
First, the core issue here is that even though Splash is supposed to execute JavaScript, your request isn’t giving it enough time or the right configuration to wait for the dynamic content to load. Here’s what to check and adjust:
1. Add a Wait Time for AJAX/JS Content
Most dynamic content loads asynchronously after the initial page render. Your current SplashRequest doesn’t include a wait parameter, so Splash might be returning the page before the target HTML gets injected.
Fix this by using a custom Lua script to have precise control over the wait period:
# Update your start_requests method with this def start_requests(self): # Lua script to navigate, wait for JS, and return full rendered HTML lua_script = """ function main(splash, args) splash:go(args.url) -- Wait 5 seconds for AJAX calls to finish and content to render splash:wait(5) return splash:html() end """ yield SplashRequest( url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962', callback=self.parse, endpoint='execute', args={'lua_source': lua_script} )
2. Spoof a Realistic User Agent
Many sites block or serve different content to non-browser user agents. Splash uses a default UA that’s easy to detect, so override it to match a real browser:
# Update the Lua script to include a valid user agent lua_script = """ function main(splash, args) splash:set_user_agent(args.ua) splash:go(args.url) splash:wait(5) return splash:html() end """ yield SplashRequest( # ... other parameters ... args={ 'lua_source': lua_script, 'ua': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' } )
3. Verify Splash is Running Correctly
Double-check that your Splash instance at http://localhost:8050 is working:
- Visit the Splash web UI directly, paste your target URL, and see if the rendered page includes your desired HTML.
- If it doesn’t, your Splash service might be missing dependencies (like Firefox/Xvfb) or need an update.
4. Add Common Browser Headers
Some sites use headers like Referer or Accept to validate requests. Add these to make your request look more like a real browser’s:
# Add headers to your SplashRequest headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Referer': 'https://www.gaslicht.com/' } yield SplashRequest( # ... other parameters ... headers=headers )
Modified Full Code Example
Here’s your updated spider with all these fixes applied:
import scrapy from scrapy_splash import SplashRequest from bs4 import BeautifulSoup class NetherSplashSpider(scrapy.Spider): name = 'nether_splash' download_delay = 10 custom_settings = { 'SPLASH_URL': 'http://localhost:8050', 'DOWNLOADER_MIDDLEWARES': { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }, 'SPIDER_MIDDLEWARES': { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', } def start_requests(self): lua_script = """ function main(splash, args) splash:set_user_agent(args.ua) splash:go(args.url) splash:wait(5) return splash:html() end """ headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Referer': 'https://www.gaslicht.com/' } yield SplashRequest( url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962', callback=self.parse, endpoint='execute', args={ 'lua_source': lua_script, 'ua': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' }, headers=headers ) def parse(self, response): filename = 'splash.html' with open(filename, 'wb') as f: f.write(response.body) # Quick check to verify target content exists soup = BeautifulSoup(response.body, 'html.parser') # Replace 'your-target-selector' with the actual CSS/XPath selector for your content target = soup.select_one('your-target-selector') if target: self.logger.info(f"Found target content: {target.get_text(strip=True)}") else: self.logger.warning("Target content still missing - might need longer wait time or additional headers")
Final Notes
If you still can’t get the content, inspect your browser’s network tab to see if the target data is loaded via a separate AJAX API call. Sometimes it’s more efficient to call that API directly with Scrapy instead of rendering the entire page with Splash.
内容的提问来源于stack exchange,提问作者pap




