基于LDA与Gensim模型为CSV中10k无预定义主题句子生成关键词标题

基于LDA与Gensim模型为CSV中10k无预定义主题句子生成关键词标题

阿华AIGC实验室

2026-5-11

嘿，这事儿我熟！用Gensim搭LDA模型给每句生成顶流关键词当标题，步骤其实挺清晰的，我给你捋一遍实操流程：

1. 先把环境准备好

首先得装齐需要的工具包，打开终端跑这个命令：

pip install pandas gensim nltk

2. 加载你的CSV数据

用Pandas把10000条句子读进来，假设CSV里存句子的列名叫sentence，如果你的列名不一样，记得改：

import pandas as pd

# 替换成你的CSV文件路径
df = pd.read_csv('your_test_file.csv')
# 确保没有空值，有空的话可以删掉或者填充
df = df.dropna(subset=['sentence'])

3. 文本预处理（重中之重！）

LDA模型对干净的文本才友好，所以得把句子“洗干净”：转小写、去标点、分词、删停用词（比如“的”“是”“啊”这种没用的词）、词形还原（把“跑着”“跑过”变成“跑”）。

先下载NLTK需要的资源，第一次跑的时候执行：

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

然后写个预处理函数：

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# 初始化停用词和词形还原器
stop_words = set(stopwords.words('english'))  # 如果是中文，换用中文停用词库
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # 转小写
    text = text.lower()
    # 去掉标点、数字和特殊字符
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 分词
    words = text.split()
    # 去掉停用词+词形还原
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    # 过滤掉单个字符的词
    words = [word for word in words if len(word) > 1]
    return words

# 对所有句子做预处理
df['processed_text'] = df['sentence'].apply(preprocess_text)

注意：如果你的文本是中文，得换成中文停用词库（比如用jieba分词+中文停用词表），预处理逻辑调整下就行。

4. 构建Gensim需要的语料库和词典

Gensim的LDA模型得用它专属的格式输入，所以先做这两步：

from gensim.corpora import Dictionary

# 构建词典：把所有预处理后的词映射成唯一ID
dictionary = Dictionary(df['processed_text'])
# 过滤极端词：去掉出现次数少于5次的（太罕见）和超过50%文档都有的（太通用）
dictionary.filter_extremes(no_below=5, no_above=0.5)
# 生成词袋语料库：每个句子转成(词ID, 词频)的列表
corpus = [dictionary.doc2bow(text) for text in df['processed_text']]

5. 训练LDA模型（选对主题数很关键）

因为你没有预定义主题，得自己选合适的主题数。可以通过一致性得分来判断——得分越高，主题越有意义。

先写个函数帮你选主题数：

from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

def find_best_topic_num(corpus, dictionary, texts, start=5, end=30, step=5):
    coherence_scores = []
    model_list = []
    for num_topics in range(start, end+1, step):
        model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_scores.append(coherencemodel.get_coherence())
    
    # 选得分最高的模型
    best_idx = coherence_scores.index(max(coherence_scores))
    best_model = model_list[best_idx]
    best_num_topics = start + best_idx*step
    print(f"最优主题数：{best_num_topics}，对应的一致性得分：{max(coherence_scores):.4f}")
    return best_model

# 跑这个函数找最优模型（范围可以自己调，比如5到50）
lda_model = find_best_topic_num(corpus, dictionary, df['processed_text'], start=10, end=50, step=5)

如果嫌慢，可以把passes调小（比如改成5），但效果会打折扣。

6. 给每个句子生成顶部关键词标题

现在有了训练好的模型，就可以给每句生成关键词了。逻辑是：先找到句子最匹配的主题，然后取该主题的Top N关键词（比如Top3）作为标题。

写个函数生成标题：

def generate_title(text_bow, model, dictionary, top_n=3):
    # 得到句子在各个主题上的概率分布
    topic_probs = model[text_bow]
    # 取概率最高的主题
    dominant_topic = max(topic_probs, key=lambda x: x[1])[0]
    # 取该主题的Top N关键词
    topic_keywords = model.show_topic(dominant_topic, topn=top_n)
    # 把关键词拼成标题（用空格或逗号分隔）
    title = ', '.join([word for word, prob in topic_keywords])
    return title

# 给每个句子生成标题
df['title_keywords'] = df['processed_text'].apply(lambda x: generate_title(dictionary.doc2bow(x), lda_model, dictionary))

如果你想更贴合句子本身，可以结合句子里的词频，比如从主题关键词里挑句子中实际出现的词，这样标题更精准。

7. 保存结果

最后把生成的标题和原句子一起存回CSV：

df.to_csv('sentences_with_titles.csv', index=False)

一些小Tips

如果是中文文本，把分词换成jieba，停用词用中文停用词表，预处理逻辑调整下就行。
主题数不是固定的，你可以多试几个数值，看哪个生成的关键词更符合预期。
要是觉得LDA的结果不够精准，可以结合TF-IDF模型，先提取每个句子的TF-IDF Top关键词，再和LDA主题词结合。

内容的提问来源于stack exchange，提问作者ashu

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠