如何统计字典列表中词汇词频并生成相似度计算所需格式？

如何统计字典列表中词汇词频并生成相似度计算所需格式？

阿华AIGC实验室

2026-5-27

解决方案：Python实现字典列表到相似度计算格式的转换

我帮你写了一个Python脚本，完美实现你需要的格式转换，下面一步步拆解逻辑，方便你理解和调整：

核心转换规则

从你的输入输出示例可以明确规则：

全局词频统计：每个词开头后跟它在所有输入字典（文档单元）中出现的总次数（因为每个输入字典里的词值为1，代表在该单元出现1次）
共现词列表：紧随目标词的是所有和它在至少一个文档单元里共同出现过的词，每个词格式为词: 全局次数
每个目标词的条目单独成一行

完整代码实现

# 你的输入数据
input_data = [
    {'mississippi': 1, 'worth': 1, 'reading': 1},
    {'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1},
    {'considering': 1, 'missouri': 1, 'main': 1, 'branch': 1, 'longest': 1, 'river': 1, 'world--four': 1},
    {'seems': 1, 'safe': 1, 'crookedest': 1, 'river': 1, 'part': 1, 'journey': 1, 'uses': 1, 'cover': 1, 'ground': 1, 'crow': 1, 'fly': 1, 'six': 1, 'seventy-five': 1},
    {'discharges': 1, 'water': 1, 'st': 1},
    {'lawrence': 1, 'twenty-five': 1, 'rhine': 1, 'thirty-eight': 1, 'thames': 1},
    {'river': 1, 'vast': 1, 'drainage-basin:': 1, 'draws': 1, 'water': 1, 'supply': 1, 'twenty-eight': 1, 'states': 1, 'territories': 1, 'delaware': 1, 'atlantic': 1, 'seaboard': 1, 'country': 1, 'idaho': 1, 'pacific': 1, 'slope--a': 1, 'spread': 1, 'forty-five': 1, 'degrees': 1, 'longitude': 1},
    {'mississippi': 1, 'receives': 1, 'carries': 1, 'gulf': 1, 'water': 1, 'fifty-four': 1, 'subordinate': 1, 'rivers': 1, 'navigable': 1, 'steamboats': 1, 'hundreds': 1, 'flats': 1, 'keels': 1},
    {'area': 1, 'drainage-basin': 1, 'combined': 1, 'areas': 1, 'england': 1, 'wales': 1, 'scotland': 1, 'ireland': 1, 'france': 1, 'spain': 1, 'portugal': 1, 'germany': 1, 'austria': 1, 'italy': 1, 'turkey': 1, 'almost': 1, 'wide': 1, 'region': 1, 'fertile': 1, 'mississippi': 1, 'valley': 1, 'proper': 1, 'exceptionally': 1}
]

# 1. 计算全局词频和共现关系
global_freq = {}  # key: 词, value: 出现的文档单元数
co_occur = {}     # key: 词, value: 所有共现词的集合

for doc in input_data:
    doc_words = list(doc.keys())
    # 更新全局词频
    for word in doc_words:
        global_freq[word] = global_freq.get(word, 0) + 1
    # 更新共现关系：每个词和当前文档的其他词建立共现
    for word in doc_words:
        if word not in co_occur:
            co_occur[word] = set()
        for other_word in doc_words:
            if other_word != word:
                co_occur[word].add(other_word)

# 2. 生成目标格式的输出
output_lines = []
# 按字母顺序排序目标词（可替换为你需要的排序规则）
for target_word in sorted(global_freq.keys()):
    # 构建行的开头：目标词 + 全局次数
    line_start = f"{target_word} {global_freq[target_word]}"
    # 构建共现词部分：每个词格式为 "词: 次数"
    co_parts = [f"{co_word}: {global_freq[co_word]}" for co_word in sorted(co_occur[target_word])]
    # 拼接成完整行
    full_line = ' '.join([line_start] + co_parts)
    output_lines.append(full_line)

# 打印结果（或写入文件）
print('\n'.join(output_lines))

自定义调整选项

排序规则：如果不想按字母顺序排列目标词或共现词，可以修改sorted()的参数，比如按全局词频降序：sorted(global_freq.keys(), key=lambda x: -global_freq[x])
特殊词处理：输入中的drainage-basin:和drainage-basin会被视为两个不同的词，代码会自动区分；如果你想合并它们，可以在预处理阶段添加字符串替换逻辑

输出保存：如果需要把结果保存到文件，替换print()为：

with open('similarity_format.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(output_lines))

内容的提问来源于stack exchange，提问作者Jim Ye

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠