如何在Python中去除非英文单词？适配加密货币推特情感分析项目

阿华AIGC实验室

2026-5-21

解决NLP情感分析项目中的非英文单词去除问题

Hey there! Let's tackle that non-English word removal problem for your crypto tweet sentiment analysis project. You've already got your cleaned CSV of crypto tweets and your core libraries imported, so let's dive into practical, effective ways to filter out non-English words right away:

方法1：用`langdetect`检测单词语言

这个方法针对单个单词做语言判定，精准保留英文内容。先装依赖，再写处理逻辑：

先安装所需库：
pip install langdetect
编写处理函数并应用到数据集：

from langdetect import detect, LangDetectException
import pandas as pd

# 先读取CSV（用chardet搞定编码问题）
with open('your_crypto_tweets.csv', 'rb') as f:
    encoding_result = chardet.detect(f.read())
df = pd.read_csv('your_crypto_tweets.csv', encoding=encoding_result['encoding'])

# 定义去除非英文单词的函数
def keep_only_english_words(text):
    if pd.isna(text):
        return ''
    word_list = text.split()
    english_words = []
    for word in word_list:
        try:
            # 只保留检测为英文的单词
            if detect(word) == 'en':
                english_words.append(word)
        except LangDetectException:
            # 跳过无法检测语言的短词/特殊词
            continue
    return ' '.join(english_words)

# 应用到你的文本列（假设列名为'tweet_content'）
df['processed_tweet'] = df['tweet_content'].apply(keep_only_english_words)

小提示：langdetect对极短单词（比如1-2个字母）检测容易出错，你可以加个长度判断，比如只检测长度≥3的单词，减少误判。

方法2：用NLTK英文词汇库做匹配

如果你的场景不需要处理多语言混合单词，只是过滤掉不在英文词典里的词，这个方法更高效：

先安装NLTK并下载词汇库：

import nltk
nltk.download('words')
from nltk.corpus import words

编写过滤函数，别忘了补充加密货币术语：

# 加载默认英文词汇库
english_vocab = set(words.words())
# 手动添加加密货币专属术语，避免被误删
crypto_terms = {'bitcoin', 'btc', 'ethereum', 'eth', 'solana', 'sol', 'dogecoin', 'doge'}
english_vocab.update(crypto_terms)

def filter_by_english_vocab(text):
    if pd.isna(text):
        return ''
    word_list = text.split()
    filtered_words = [word for word in word_list if word.lower() in english_vocab]
    return ' '.join(filtered_words)

df['processed_tweet'] = df['tweet_content'].apply(filter_by_english_vocab)

方法3：用正则表达式过滤ASCII字符

这是最快速的轻量方案，直接保留只含英文字母（可按需加数字/特殊符号）的内容，适合快速过滤明显的非英文文本：

import re

def filter_ascii_only(text):
    if pd.isna(text):
        return ''
    # 匹配只包含英文字母、数字、@用户名和#话题标签的单词，可按需调整正则规则
    pattern = re.compile(r'\b[a-zA-Z0-9@#]+\b')
    english_words = pattern.findall(text)
    return ' '.join(english_words)

df['processed_tweet'] = df['tweet_content'].apply(filter_ascii_only)

处理完后，你可以随机抽样几条数据验证效果：

print(df[['tweet_content', 'processed_tweet']].sample(5))

搞定这些步骤后，你的数据集就只保留英文内容了，接下来就能顺利进行特征提取（比如TF-IDF、Word2Vec）和分类模型训练啦！

内容的提问来源于stack exchange，提问作者Aziz Bokhari

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

查看详情

ArkClaw

7×24在线专属智能伙伴

查看详情

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

方舟 Agent Plan