You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python运行HTML中的脚本,获取浏览器DOM渲染后的HTML?

How to Execute HTML Scripts in Python and Get the Rendered DOM

Absolutely! You can run the JavaScript embedded in HTML and retrieve the modified DOM (the same as what you see in your browser's DOM Inspector) without relying on browser simulators like Selenium. Below are practical, browser-free solutions using Python and JavaScript:

Option 1: Use PyMiniRacer (Lightweight V8 Binding)

PyMiniRacer is a slim Python wrapper around Google's V8 JavaScript engine. It's great for simple DOM manipulations and doesn't require extra dependencies beyond the Python package.

Steps:

  1. Install the package:
pip install py-mini-racer
  1. Simulate a basic DOM environment, execute the script, and update your HTML:
from py_mini_racer import py_mini_racer
import re

# Your raw HTML with embedded script
raw_html = '''<!DOCTYPE html> <html> <body> <h1>The script element</h1> <p id="demo"></p> <script> document.getElementById("demo").innerHTML = "Hello JavaScript!"; </script> </body> </html>'''

# Initialize the JS engine context
ctx = py_mini_racer.MiniRacer()

# Mock a minimal DOM environment (add more APIs if your script needs them)
ctx.eval("""
var document = {
  elements: {},
  getElementById: function(id) {
    if (!this.elements[id]) {
      this.elements[id] = { innerHTML: '' };
    }
    return this.elements[id];
  }
};
""")

# Extract the script content from the HTML (use BeautifulSoup for complex cases)
script_content = re.findall(r'<script>(.*?)</script>', raw_html, re.DOTALL)[0].strip()
ctx.eval(script_content)

# Fetch the modified content and update the original HTML
updated_demo = ctx.eval("document.getElementById('demo').innerHTML")
final_html = raw_html.replace('<p id="demo"></p>', f'<p id="demo">{updated_demo}</p>')

print(final_html)

Note: This works best for straightforward DOM changes. If your script relies on more complex browser APIs (like querySelector, event listeners, or BOM objects), you'll need to extend the mock DOM environment manually.

Option 2: Combine Python with Node.js & jsdom (Full DOM Support)

For pages with complex scripts that depend on a complete browser-like DOM environment, using Node.js's jsdom library is the most reliable approach. jsdom accurately simulates browser DOM behavior, including asynchronous scripts and CSSOM.

Steps:

  1. Install Node.js, then install jsdom via npm:
npm install jsdom
  1. Create a Node.js script (e.g., render_dom.js) to handle DOM rendering:
const { JSDOM } = require('jsdom');
const rawHtml = process.argv[2];

// Initialize JSDOM and run embedded scripts
const dom = new JSDOM(rawHtml, { runScripts: "dangerously" });

// Wait for any async scripts to finish (adjust timeout as needed)
setTimeout(() => {
  console.log(dom.serialize());
}, 100);
  1. Call this script from Python to get the rendered HTML:
import subprocess

raw_html = '''<!DOCTYPE html> <html> <body> <h1>The script element</h1> <p id="demo"></p> <script> document.getElementById("demo").innerHTML = "Hello JavaScript!"; </script> </body> </html>'''

# Execute the Node.js script and capture output
result = subprocess.run(
    ['node', 'render_dom.js', raw_html],
    capture_output=True,
    text=True
)

final_rendered_html = result.stdout.strip()
print(final_rendered_html)

Note: The dangerously flag is required to run embedded scripts—only use this with HTML from trusted sources to avoid security risks.

Option 3: Use PyV8 (Legacy V8 Binding)

PyV8 is another Python wrapper for the V8 engine, though it's less actively maintained than PyMiniRacer and may have compatibility issues with newer Python versions.

Example Code:

import PyV8
import re

raw_html = '''<!DOCTYPE html> <html> <body> <h1>The script element</h1> <p id="demo"></p> <script> document.getElementById("demo").innerHTML = "Hello JavaScript!"; </script> </body> </html>'''

# Create and enter a V8 context
ctxt = PyV8.JSContext()
ctxt.enter()

# Mock basic DOM functionality
ctxt.eval("""
var document = {
  elements: {},
  getElementById: function(id) {
    if (!this.elements[id]) {
      this.elements[id] = { innerHTML: '' };
    }
    return this.elements[id];
  }
};
""")

# Extract and run the script
script_content = re.findall(r'<script>(.*?)</script>', raw_html, re.DOTALL)[0].strip()
ctxt.eval(script_content)

# Update the HTML with modified content
updated_content = ctxt.eval("document.getElementById('demo').innerHTML")
final_html = raw_html.replace('<p id="demo"></p>', f'<p id="demo">{updated_content}</p>')

print(final_html)
ctxt.leave()

Quick Recommendation

  • For simple scripts: Go with PyMiniRacer (fast, lightweight)
  • For complex pages needing full browser DOM support: Use Python + Node.js + jsdom (most reliable)
  • Avoid PyV8 unless you have specific legacy needs

内容的提问来源于stack exchange,提问作者AKedzierski

火山引擎 最新活动