在Python中移除Unicode编码的HTML标签

阿华AIGC实验室

2026-4-27

Handling Unicode-Escaped HTML Tags in Python

Great question! The core issue here is that libraries like BeautifulSoup and Bleach are designed to parse actual HTML tags (marked with < and >), not their Unicode-escaped equivalents like < (decimal) or < (hexadecimal). These escaped sequences are treated as plain text by default, so the libraries don't recognize them as tags to remove.

Here's a reliable, two-step solution using Python's standard library plus your existing tools:

Step 1: Decode Unicode-Escaped HTML Entities

First, convert all escaped Unicode characters back to their raw HTML equivalents. Python's built-in html module has a perfect method for this: html.unescape(). It handles both decimal and hexadecimal escape sequences, turning things like <div> into <div>.

Step 2: Extract Clean Text with BeautifulSoup (or Bleach)

Once you've decoded the string into proper HTML, you can use your familiar tools to strip tags and extract core content.

Example with BeautifulSoup

import html
from bs4 import BeautifulSoup

# Your input string with Unicode-escaped tags
raw_input = '"overview":"&#60;p&#62;WTS/VDI macOS.....&#60;/p&#62;"'

# Decode the escaped entities
decoded_html = html.unescape(raw_input)

# Parse and extract clean text
soup = BeautifulSoup(decoded_html, "html.parser")
clean_content = soup.get_text()

print(clean_content)
# Output: "overview":"WTS/VDI macOS....."

Example with Bleach

If you prefer using Bleach instead, the workflow is almost identical:

import html
import bleach

raw_input = '"overview":"&#60;p&#62;WTS/VDI macOS.....&#60;/p&#62;"'
decoded_html = html.unescape(raw_input)

# Strip all tags completely
clean_content = bleach.clean(
    decoded_html,
    tags=[],
    attributes={},
    styles=[],
    strip=True
)

print(clean_content)
# Same desired output as above

Why This Works

Unicode-escaped tags are just text representations of HTML characters—they aren't recognized as structural tags until you convert them back to < and >. By decoding first, you're giving BeautifulSoup/Bleach the properly formatted HTML they expect to process.

A Note on Avoiding Regex

You might be tempted to use regex to directly match and remove <.*?> sequences, but this is risky. HTML can have nested tags, escaped quotes, or edge cases that regex will fail to handle correctly. Sticking to proper HTML parsing libraries after decoding is far more robust.

内容的提问来源于stack exchange，提问作者Vishnukk