You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何通过Python调用浏览器打开网页并提取加载后HTML(不使用bs4等框架)

不用爬虫库,调用浏览器提取加载后HTML的解决方案

先说明:我了解bs4、scrapy、selenium等工具均可实现此功能,但因多种原因不想使用这些工具。我希望通过Python调用Chrome、IE、Firefox等浏览器打开网页,并提取该浏览器加载完成后的页面HTML内容。现有如下代码片段,需完善提取HTML的部分:

import webbrowser
import time
class ScreenCapture:
    url = 'https://www.google.com/'
    webbrowser.get("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s").open(url)
    # get html from browser here

Hey there! Let's break this down. First, the webbrowser module you're starting with is only meant to launch browsers—it doesn't have any tools to interact with the browser or fetch page content after it opens. So we need to use alternative approaches that still skip the scrapers you want to avoid. Here are two solid options:

Option 1: Chrome via Remote Debugging Protocol (No Selenium Needed)

Chrome has a built-in remote debugging feature that lets us communicate with it directly over HTTP/websockets. We can use this to launch Chrome, wait for the page to load, and pull the full HTML.

Steps & Code:

  1. First install the only dependency we'll need: pip install requests
  2. Here's the updated class:
import requests
import subprocess
import time
import json

class ScreenCapture:
    def get_chrome_html(self, url='https://www.google.com/'):
        # Define Chrome path and launch arguments with remote debugging enabled
        chrome_path = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
        chrome_args = [
            chrome_path,
            '--remote-debugging-port=9222',
            '--user-data-dir=C:/Temp/ChromeDebugProfile',  # Temp profile to avoid conflicts with your main Chrome
            url
        ]
        
        # Launch Chrome
        subprocess.Popen(chrome_args)
        # Wait for page to load (adjust sleep time based on your internet speed)
        time.sleep(5)
        
        try:
            # Fetch list of open pages from Chrome's debug endpoint
            resp = requests.get('http://localhost:9222/json')
            pages = resp.json()
            # Find the target page's debug details
            target_page = next(page for page in pages if url in page['url'])
            
            # Use Chrome DevTools Protocol to evaluate JavaScript and get full HTML
            html_response = requests.post(
                target_page['devtoolsFrontendUrl'].replace('/devtools/inspector', '/json'),
                json={
                    "id": 1,
                    "method": "Runtime.evaluate",
                    "params": {
                        "expression": "document.documentElement.outerHTML"
                    }
                }
            )
            # Extract the HTML content from the response
            full_html = html_response.json()['result']['result']['value']
            
            return full_html
        finally:
            # Optional: Uncomment below to close Chrome after fetching HTML
            # subprocess.call(['taskkill', '/F', '/IM', 'chrome.exe'])
            pass

# Test it out
sc = ScreenCapture()
page_html = sc.get_chrome_html()
print(page_html[:500])  # Print first 500 chars to verify

Option 2: Internet Explorer via Windows COM (Windows Only)

If you're on Windows, you can use the pywin32 library to directly control IE via its COM interface—this lets us launch IE, wait for the page to load, and pull HTML without any scrapers.

Steps & Code:

  1. Install the dependency: pip install pywin32
  2. Updated class code:
import win32com.client
import time

class ScreenCapture:
    def get_ie_html(self, url='https://www.google.com/'):
        # Launch IE instance
        ie = win32com.client.Dispatch("InternetExplorer.Application")
        ie.Visible = True  # Set to False if you don't want to see the browser window
        ie.Navigate(url)
        
        # Wait until the page finishes loading
        while ie.Busy or ie.ReadyState != 4:
            time.sleep(1)
        
        # Grab the full HTML of the loaded page
        full_html = ie.Document.documentElement.outerHTML
        
        # Optional: Uncomment below to close IE after fetching
        # ie.Quit()
        
        return full_html

# Test it out
sc = ScreenCapture()
page_html = sc.get_ie_html()
print(page_html[:500])

Quick Notes:

  • For Firefox, similar remote debugging exists, but the setup is a bit more involved—let me know if you need that specifically!
  • The webbrowser module can't be repurposed to get HTML, since it doesn't support any browser interaction beyond launching.

内容的提问来源于stack exchange,提问作者Alex D

火山引擎 最新活动