如何在JavaScript中抓取真实媒体URL而非Blob URL?——基于NightmareJS的爬取异常排查
Hey there, let's sort out this problem where you're getting Blob URLs instead of the actual video source with NightmareJS. The root cause here is that the site is using HLS streaming (.m3u8 files) and loading them via the MediaSource API, which creates a Blob URL for the video element—so grabbing the src attribute directly won't give you the real stream URL. Here's how to adjust your code to capture the actual .m3u8 link:
Step 1: Capture .m3u8 Requests via Nightmare's Network Listener
Instead of extracting the video element's src, we'll listen for all network requests made by Nightmare and filter out the .m3u8 stream URL. This works because the browser has to fetch the .m3u8 file before converting it to a Blob URL.
Modified Code Example
Here's your updated route handler with the necessary changes:
app.get("/video/:VideoName", function(req, res) { var VideoName = req.params.VideoName; request("https://sample.com/videos/" + VideoName, function(err, response, html) { if (!err && response.statusCode == 200) { const $ = cheerio.load(html); const videoInfo = $(".video-info"); const includesVideo = VidLinks.find(e => e.name == VideoName); if (includesVideo) { res.render("videoPlayer", { episode: includesVideo }); } else { const Nightmare = require('nightmare'); const nightmare = Nightmare({ show: true }); let m3u8Url = null; // Store the captured .m3u8 URL // Listen for network requests to capture .m3u8 links nightmare.on('request', (request) => { // Check if the request URL ends with .m3u8 (adjust regex if needed for variant streams) if (/\.m3u8$/i.test(request.url)) { console.log("Found HLS stream:", request.url); m3u8Url = request.url; } }); var iframeLink = videoInfo.find("iframe").attr("src"); iframeLink = iframeLink.replace(/\/\//g, "https://"); nightmare .goto(iframeLink) .click("#myVideo") // Trigger video loading to initiate stream requests .wait(3000) // Give time for the stream to load (adjust delay based on site speed) .end() .then(() => { if (m3u8Url) { // Store the actual stream URL instead of Blob VidLinks.push({ name: VideoName, url: m3u8Url }); res.render("videoPlayer", { video: m3u8Url }); } else { res.send("Failed to capture video stream URL"); } }) .catch((error) => { console.error("Nightmare error:", error); res.send("Error capturing video stream"); }); } } else { res.send("500 Error, Try Again."); console.log("ERROR IS HERE! - " + err); } }); });
Key Changes Explained
- Network Request Listener: Added
nightmare.on('request')to watch all outgoing requests. We check for URLs ending in.m3u8to identify the HLS stream source. - Wait for Stream Initialization: The
.wait(3000)gives the browser time to fetch the .m3u8 file after clicking the video. Tweak this delay if the site loads streams faster or slower. - Store the Real Stream URL: Instead of saving the Blob URL to
VidLinks, we store the actual .m3u8 playlist address.
Additional Notes
- If the site offers multiple quality variants (e.g., 720p, 1080p), you might capture several .m3u8 URLs. Refine the regex to target specific streams (e.g.,
/1080p.*\.m3u8$/i). - .m3u8 is a streaming playlist, not a direct MP4 file. If you need a downloadable MP4, use tools like
ffmpegto convert the stream:ffmpeg -i "your-captured-m3u8-url" -c copy output.mp4 - Always ensure you have permission to scrape and access the video content—respect the site's terms of service and
robots.txtrules.
内容的提问来源于stack exchange,提问作者Venkat Lohith Dasari




