技术问询：如何从文本中移除或分离UTF-8编码的表情符号

阿华AIGC实验室

2026-5-14

嘿，这个需求太常见了！要从文本里移除那些UTF-8编码的表情符号，我给你分享几个实用的方法，直接上手就能用：

方法1：正则表达式精准匹配移除

表情符号大多集中在特定的Unicode编码范围内，我们可以用正则来匹配这些范围，然后替换为空字符串。这种方法不用额外装库，原生Python就能搞定。

示例代码：

import re

# 先把bytes类型文本解码成字符串
raw_text = b'That new one I\xe2\x80\x99m Ikorodu is a masterpiece.Thanks for beautifying the landscape. \xf0\x9f\x91\x8d\xf0\x9f\x8f\xbdUnlike @jpoy that build banks like Prisons where human organs are harvested.'
text = raw_text.decode('utf-8')

# 匹配常见表情符号的正则表达式
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # 表情符号
        u"\U0001F300-\U0001F5FF"  # 符号& pictographs
        u"\U0001F680-\U0001F6FF"  # 交通&地图符号
        u"\U0001F1E0-\U0001F1FF"  # 国旗
        u"\U00002500-\U00002BEF"  # 各种符号
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # 变体选择符-16
        u"\u3030"
                      "]+", re.UNICODE)

clean_text = emoji_pattern.sub(r'', text)
print(clean_text)

运行后你会得到去掉表情的干净文本：That new one I’m Ikorodu is a masterpiece.Thanks for beautifying the landscape. Unlike @jpoy that build banks like Prisons where human organs are harvested.

方法2：利用unicodedata过滤非普通字符

如果不想写复杂的正则，可以用Python内置的unicodedata模块，通过字符的类别来过滤掉表情符号——表情一般属于“符号”类（比如'So'类别），我们只保留文本类的字符（字母、数字、标点等）。

示例代码：

import unicodedata

raw_text = b'That new one I\xe2\x80\x99m Ikorodu is a masterpiece.Thanks for beautifying the landscape. \xf0\x9f\x91\x8d\xf0\x9f\x8f\xbdUnlike @jpoy that build banks like Prisons where human organs are harvested.'
text = raw_text.decode('utf-8')

# 过滤掉属于符号类的字符
clean_text = ''.join(c for c in text if unicodedata.category(c) not in ['So'])
print(clean_text)

这个方法代码更简洁，但要注意：它会过滤掉所有“其他符号”，如果你的文本里有需要保留的特殊符号，可能需要调整允许的字符类别。

方法3：用专门的emoji库（最省心）

如果经常处理表情符号，直接用第三方的emoji库会更精准，它专门做emoji的识别和处理，不用自己维护正则范围。

首先安装库：

pip install emoji

然后使用：

import emoji

raw_text = b'That new one I\xe2\x80\x99m Ikorodu is a masterpiece.Thanks for beautifying the landscape. \xf0\x9f\x91\x8d\xf0\x9f\x8f\xbdUnlike @jpoy that build banks like Prisons where human organs are harvested.'
text = raw_text.decode('utf-8')

clean_text = emoji.replace_emoji(text, replace='')
print(clean_text)

这个方法最省心，库会自动识别所有标准emoji并替换掉，完全不用自己操心编码范围的问题。

内容的提问来源于stack exchange，提问作者Oluwatobi Shoyinka