如何用Python 3/Python 2解析提取格式错误、编码异常的XML文档文本

阿华AIGC实验室

2026-5-22

Hey there! Let's work through your XML parsing challenges together—messy encoding, broken structure, and struggling to pull out specific tags with ElementTree are super common pain points, so I’ve got some practical fixes for you:

1. Handle the ISO-8859-1 Encoding First

ElementTree can be finicky with non-UTF-8 encodings if you don’t read the file correctly. Start by explicitly reading the file with the right encoding to avoid garbled text:

# Read the raw XML with ISO-8859-1 encoding
with open("your_input.xml", "r", encoding="ISO-8859-1") as xml_file:
    xml_raw = xml_file.read()

# Optional: Convert to UTF-8 for easier downstream handling (recommended)
with open("converted_utf8.xml", "w", encoding="utf-8") as converted_file:
    converted_file.write(xml_raw)

2. Fix Malformed XML Structure

If your XML has unclosed tags, mismatched elements, or other structural errors, ElementTree will throw errors immediately. The easiest fix here is using lxml’s forgiving parser—it can recover from many common mistakes:

from lxml import etree

# Use lxml's recovery mode to parse broken XML
parser = etree.XMLParser(recover=True, encoding="ISO-8859-1")
tree = etree.fromstring(xml_raw.encode("ISO-8859-1"), parser=parser)

If you can’t install lxml, you can use simple regex to patch obvious issues (like unclosed tags) before feeding to ElementTree:

import re

# Quick fix for unclosed tags (adjust regex based on your specific issues)
xml_fixed = re.sub(r"<(\w+)>(?!.*<\/\1>)", r"<\1></\1>", xml_raw)

3. Extract Specific Tags Without Fighting Noise

ElementTree’s basic find methods can feel limited when dealing with cluttered XML. Switch to XPath (supported by both lxml and ElementTree) for precise targeting:

With lxml (more powerful XPath support):

# Grab all text inside <your_target_tag> elements, no matter where they are
target_texts = tree.xpath("//your_target_tag/text()")
for text in target_texts:
    print(text.strip())

With standard ElementTree:

import xml.etree.ElementTree as ET

# Parse the fixed XML content
root = ET.fromstring(xml_fixed)
# Find all target tags (use .// to search all nested levels)
for tag in root.findall(".//your_target_tag"):
    if tag.text:
        print(tag.text.strip())

Alternatively, if the XML is really messy, BeautifulSoup works surprisingly well for XML extraction too:

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_raw, "xml")
target_tags = soup.find_all("your_target_tag")
for tag in target_tags:
    print(tag.get_text(strip=True))

A quick note: If the XML structure is severely broken (like missing root elements), you might need to manually add a wrapper root tag before parsing—something like wrapping the entire content in <root>...</root>.

内容的提问来源于stack exchange，提问作者user9608799