在Python中移除Unicode编码的HTML标签
Great question! The core issue here is that libraries like BeautifulSoup and Bleach are designed to parse actual HTML tags (marked with < and >), not their Unicode-escaped equivalents like < (decimal) or < (hexadecimal). These escaped sequences are treated as plain text by default, so the libraries don't recognize them as tags to remove.
Here's a reliable, two-step solution using Python's standard library plus your existing tools:
Step 1: Decode Unicode-Escaped HTML Entities
First, convert all escaped Unicode characters back to their raw HTML equivalents. Python's built-in html module has a perfect method for this: html.unescape(). It handles both decimal and hexadecimal escape sequences, turning things like <div> into <div>.
Step 2: Extract Clean Text with BeautifulSoup (or Bleach)
Once you've decoded the string into proper HTML, you can use your familiar tools to strip tags and extract core content.
Example with BeautifulSoup
import html from bs4 import BeautifulSoup # Your input string with Unicode-escaped tags raw_input = '"overview":"<p>WTS/VDI macOS.....</p>"' # Decode the escaped entities decoded_html = html.unescape(raw_input) # Parse and extract clean text soup = BeautifulSoup(decoded_html, "html.parser") clean_content = soup.get_text() print(clean_content) # Output: "overview":"WTS/VDI macOS....."
Example with Bleach
If you prefer using Bleach instead, the workflow is almost identical:
import html import bleach raw_input = '"overview":"<p>WTS/VDI macOS.....</p>"' decoded_html = html.unescape(raw_input) # Strip all tags completely clean_content = bleach.clean( decoded_html, tags=[], attributes={}, styles=[], strip=True ) print(clean_content) # Same desired output as above
Why This Works
Unicode-escaped tags are just text representations of HTML characters—they aren't recognized as structural tags until you convert them back to < and >. By decoding first, you're giving BeautifulSoup/Bleach the properly formatted HTML they expect to process.
A Note on Avoiding Regex
You might be tempted to use regex to directly match and remove <.*?> sequences, but this is risky. HTML can have nested tags, escaped quotes, or edge cases that regex will fail to handle correctly. Sticking to proper HTML parsing libraries after decoding is far more robust.
内容的提问来源于stack exchange,提问作者Vishnukk




