Python新手求助：如何从文本列表中去除HTML标签及解决报错

阿华AIGC实验室

2026-5-26

解决Python去除HTML标签时的feed方法报错问题

嘿，作为纯新手能成功提取博客帖子的字符串列表已经超棒啦！别为这个报错发愁，咱们一步步来搞定它～

首先，你遇到的feed()方法报错，最常见的原因是传入的内容不是字符串类型——虽然你说提取的是字符串列表，但可能其中混了None、数字或者其他非字符串元素，导致MLStripper的feed()方法没法处理。另外，也可能是你参考的MLStripper类实现有小问题。

方案一：修复MLStripper实现+校验输入类型

先确保你的MLStripper类是正确的，同时在处理前校验输入是否为字符串：

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []
    
    def handle_data(self, d):
        self.fed.append(d)
    
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    # 先把非字符串内容转成字符串，跳过空值
    if html is None:
        return ""
    if not isinstance(html, str):
        html = str(html)
    s = MLStripper()
    s.feed(html)
    return s.get_data()

# 用这个函数处理你的帖子列表
cleaned_posts = [strip_tags(post) for post in your_post_list]

方案二：用更简单的BeautifulSoup（新手友好）

如果你觉得HTMLParser的写法有点绕，推荐用BeautifulSoup库，它更直观，容错性也更强：

先安装库：在终端里运行 pip install beautifulsoup4
然后用下面的代码：

from bs4 import BeautifulSoup

def strip_tags(html):
    if html is None:
        return ""
    if not isinstance(html, str):
        html = str(html)
    # 用html.parser解析器提取纯文本
    soup = BeautifulSoup(html, "html.parser")
    return soup.get_text()

# 处理你的帖子列表
cleaned_posts = [strip_tags(post) for post in your_post_list]