如何从LDA带概率主题词串中提取目标词（含年份数字）

如何从LDA带概率主题词串中提取目标词（含年份数字）

阿华AIGC实验室

2026-5-14

问题：从LDA主题结果中提取仅含关键词（含年份数字）的字符串

我正在开展主题建模工作，手里有一个存储各主题信息及对应电影的字典，示例如下：

{'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"', 
              array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction', 'Green Lantern', 'Men in Black II', 'Final Fantasy: The Spirits Within', 'Treasure Planet', 'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars', 'Independence Day', 'Titan A.E.', 'Sphere', 'Signs', 'AVP: Alien vs. Predator', 'Zathura: A Space Adventure', 'My Favorite Martian', 'I Am Number Four'], dtype=object)],
 ...}

这些主题中的词语附带词概率，是LDA提取的默认格式。我希望仅提取其中的相关词语，得到如下格式的结果：

{'Topic 49': ['alien science_fiction adventure action 2000', 
              array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction', 'Green Lantern', 'Men in Black II', 'Final Fantasy: The Spirits Within', 'Treasure Planet', 'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars', 'Independence Day', 'Titan A.E.', 'Sphere', 'Signs', 'AVP: Alien vs. Predator', 'Zathura: A Space Adventure', 'My Favorite Martian', 'I Am Number Four'], dtype=object)],
 ...}

我尝试过多种方法但均未成功，比如保留所有字符时会丢失2000这类年份术语。请问是否有方法仅提取*符号后、+符号分隔的词语（含年份数字）？

解决方案：用正则精准提取目标关键词

嘿，这个问题我处理LDA结果时也碰到过！正则表达式是解决这类字符串提取的绝佳工具，能精准抓住*后面带引号的内容，不管是单词还是数字年份都不会漏掉。

直接上可运行的Python代码：

import re
import numpy as np

# 你的原始主题字典
original_topic_dict = {
    'Topic 49': [
        '0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
        np.array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction', 'Green Lantern', 'Men in Black II', 'Final Fantasy: The Spirits Within', 'Treasure Planet', 'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars', 'Independence Day', 'Titan A.E.', 'Sphere', 'Signs', 'AVP: Alien vs. Predator', 'Zathura: A Space Adventure', 'My Favorite Martian', 'I Am Number Four'], dtype=object)
    ]
}

# 定义提取关键词的函数
def extract_topic_terms(term_string):
    # 正则规则：匹配*"XXX"格式里的XXX，支持字母、数字、下划线
    pattern = r'\*"([\w\d_]+)"'
    # 找出所有匹配的关键词
    terms = re.findall(pattern, term_string)
    # 拼成空格分隔的字符串
    return ' '.join(terms)

# 遍历字典处理每个主题
processed_topic_dict = {}
for topic_key, (term_str, movie_array) in original_topic_dict.items():
    processed_topic_dict[topic_key] = [extract_topic_terms(term_str), movie_array]

# 查看结果
print(processed_topic_dict)

代码说明

正则表达式r'\*"([\w\d_]+)"'的作用：
- \*：匹配字面的*（因为*在正则里是特殊字符，必须转义）
- "：匹配前后的双引号
- ([\w\d_]+)：捕获组，专门匹配字母、数字、下划线组成的内容，完美覆盖你的关键词和年份
re.findall()会返回所有匹配到的关键词列表，再用' '.join()拼成你想要的空格分隔格式，保证年份和单词都不会丢失。

内容的提问来源于stack exchange，提问作者J.Doe

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠