如何对TXT文件词汇执行词形还原并仅替换还原后的词？

阿华AIGC实验室

2026-5-20

嘿，我看你已经成功从林肯的文本里提取出单词了，接下来的词形还原其实没那么复杂，咱们一步步把它搞定！

首先，你已经导入了WordNetLemmatizer，但可能没注意到——这个工具默认会把单词当成名词来还原，要是遇到动词、形容词，还原效果就会打折扣。比如"running"如果不加词性标注，会被还原成"running"，但如果告诉它这是动词，就能得到正确的"run"。所以咱们得加上词性标注这一步。

第一步：补充必要的依赖和辅助函数

先下载词性标注需要的数据集，再写个小函数把NLTK的词性标签转换成WordNet能识别的格式：

import nltk, re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

# 下载需要的NLTK数据集
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')  # 用于更精准的单词拆分

# 辅助函数：把NLTK的词性标签转换成WordNet兼容的格式
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # 默认按名词处理

第二步：修改函数实现精准还原与替换

接下来调整你的lemfile函数，实现仅替换被词形还原后的词汇的需求：

def lemfile():
    lemmatizer = WordNetLemmatizer()
    
    # 用with语句读取文件，自动管理文件关闭更安全
    with open('1865-Lincoln.txt', 'r') as f:
        text = f.read().lower()
    
    # 清理特殊字符，保留字母、空格和单引号
    cleaned_text = re.sub('[^a-z\ \']+', " ", text)
    
    # 用word_tokenize拆分单词（比split()更靠谱，能正确处理"don't"这类带单引号的词）
    words = nltk.word_tokenize(cleaned_text)
    # 给每个单词标注词性
    tagged_words = nltk.pos_tag(words)
    
    # 逐个单词执行词形还原
    lemmatized_words = []
    for word, tag in tagged_words:
        # 获取WordNet兼容的词性
        pos = get_wordnet_pos(tag)
        # 执行还原
        lemma = lemmatizer.lemmatize(word, pos=pos)
        lemmatized_words.append(lemma)
    
    # 把还原后的单词重新组合成文本
    lemmatized_text = ' '.join(lemmatized_words)
    
    # 可选：把结果保存到新文件
    # with open('lemmatized_1865-Lincoln.txt', 'w') as f:
    #     f.write(lemmatized_text)
    
    return lemmatized_text

# 调用函数运行
result = lemfile()
print(result)

关键细节说明

用word_tokenize代替split()：它能正确拆分带单引号的词汇（比如"don't"不会被拆成"don"和"t"），还原结果更准确。
词性标注是核心：没有词性标注的话，动词、形容词的还原效果会大打折扣，比如"better"会被当成名词还原成"better"，但标注成形容词后会还原成"good"。
保留原文本结构：最后用join把还原后的单词拼成文本，除了被还原的词汇本身，原文本的整体结构（空格分隔的单词序列）会完全保留。

如果想要简化版（牺牲一点准确性），也可以跳过词性标注直接还原：

def simple_lemfile():
    lemmatizer = WordNetLemmatizer()
    with open('1865-Lincoln.txt', 'r') as f:
        text = f.read().lower()
    cleaned_text = re.sub('[^a-z\ \']+', " ", text)
    words = cleaned_text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

不过还是推荐带词性标注的版本，还原结果更贴合语义。

内容的提问来源于stack exchange，提问作者ArchivistG