如何用Python 3/Python 2解析提取格式错误、编码异常的XML文档文本
Hey there! Let's work through your XML parsing challenges together—messy encoding, broken structure, and struggling to pull out specific tags with ElementTree are super common pain points, so I’ve got some practical fixes for you:
ElementTree can be finicky with non-UTF-8 encodings if you don’t read the file correctly. Start by explicitly reading the file with the right encoding to avoid garbled text:
# Read the raw XML with ISO-8859-1 encoding with open("your_input.xml", "r", encoding="ISO-8859-1") as xml_file: xml_raw = xml_file.read() # Optional: Convert to UTF-8 for easier downstream handling (recommended) with open("converted_utf8.xml", "w", encoding="utf-8") as converted_file: converted_file.write(xml_raw)
If your XML has unclosed tags, mismatched elements, or other structural errors, ElementTree will throw errors immediately. The easiest fix here is using lxml’s forgiving parser—it can recover from many common mistakes:
from lxml import etree # Use lxml's recovery mode to parse broken XML parser = etree.XMLParser(recover=True, encoding="ISO-8859-1") tree = etree.fromstring(xml_raw.encode("ISO-8859-1"), parser=parser)
If you can’t install lxml, you can use simple regex to patch obvious issues (like unclosed tags) before feeding to ElementTree:
import re # Quick fix for unclosed tags (adjust regex based on your specific issues) xml_fixed = re.sub(r"<(\w+)>(?!.*<\/\1>)", r"<\1></\1>", xml_raw)
ElementTree’s basic find methods can feel limited when dealing with cluttered XML. Switch to XPath (supported by both lxml and ElementTree) for precise targeting:
- With
lxml(more powerful XPath support):
# Grab all text inside <your_target_tag> elements, no matter where they are target_texts = tree.xpath("//your_target_tag/text()") for text in target_texts: print(text.strip())
- With standard ElementTree:
import xml.etree.ElementTree as ET # Parse the fixed XML content root = ET.fromstring(xml_fixed) # Find all target tags (use .// to search all nested levels) for tag in root.findall(".//your_target_tag"): if tag.text: print(tag.text.strip())
Alternatively, if the XML is really messy, BeautifulSoup works surprisingly well for XML extraction too:
from bs4 import BeautifulSoup soup = BeautifulSoup(xml_raw, "xml") target_tags = soup.find_all("your_target_tag") for tag in target_tags: print(tag.get_text(strip=True))
A quick note: If the XML structure is severely broken (like missing root elements), you might need to manually add a wrapper root tag before parsing—something like wrapping the entire content in <root>...</root>.
内容的提问来源于stack exchange,提问作者user9608799




