同一ID下跨文本行提取高频词（排除同行重复）及TF-IDF实现咨询

阿华AIGC实验室

2026-5-22

嘿，我来帮你搞定这个需求！你要的是同一个ID下那些跨不同文本行反复出现的高频词，而且同一行里的重复词不算——这个需求挺具体的，常规TF-IDF直接用可能不太对，得先调整预处理逻辑，我给你一步步拆解：

解决同一ID下跨文本行提取高频词的方案

核心思路梳理

首先得明确你的核心诉求：统计同一ID下，词在不同text行中的出现频率，同一行内的重复词只算一次。常规TF-IDF是统计词在单篇文档内的出现次数，但这里我们需要把每个ID下的每一行text当成一个独立的「子文档」，统计词在这些子文档中的跨出现频次，再用TF-IDF或自定义规则计算权重。

具体实现步骤（以Python为例）

1. 数据预处理

先加载数据集，按ID分组，同时对每一行text做分词+行内去重（避免同一行的重复词干扰统计）：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

# 加载你的样例数据
data = pd.DataFrame({
    'id': [1,1,1,1,2,2],
    'text': [
        'Interface Down GigabitEthernet0/1/2 null .',
        'Interface Gi0/1/2 Down on node BMAC69RT01',
        'Interface Down MEth0/0/1 null .',
        'Interface MEth0/0/1 Down on node',
        'Interface Up FastEthernet0/0/0 null',
        'Interface Fa0/0/0...'
    ]
})

# 预处理函数：分词+过滤无意义词+行内去重
def process_line(text):
    # 拆分单词+过滤标点、无意义词
    raw_words = [word.strip('.').strip('...') for word in text.split()]
    filtered_words = [word for word in raw_words if word not in ['null', '.', '...']]
    # 行内去重，保留单词顺序
    return list(dict.fromkeys(filtered_words))

2. 方案一：直接统计词的「出现行占比」

这个方式最直观：统计每个词在当前ID的多少行text中出现，再除以该ID的总行数，占比越高说明跨行出现越频繁：

# 按ID统计每个词的出现行数
id_word_rank = {}

for target_id in data['id'].unique():
    # 获取当前ID的所有文本行
    id_lines = data[data['id'] == target_id]['text'].tolist()
    total_lines = len(id_lines)
    word_line_count = defaultdict(int)
    
    # 遍历每一行，统计词的出现行数
    for line in id_lines:
        line_words = set(process_line(line))
        for word in line_words:
            word_line_count[word] += 1
    
    # 计算占比并排序
    id_word_rank[target_id] = sorted(
        [(word, count/total_lines) for word, count in word_line_count.items()],
        key=lambda x: x[1],
        reverse=True
    )

# 输出结果
for idx, rank in id_word_rank.items():
    print(f"ID {idx} 的跨行高频词（按出现行占比排序）：")
    for word, ratio in rank:
        print(f"- {word}: {ratio:.2f}")

运行后ID=1的结果会是：

Interface: 1.00（每行都出现）
Down: 1.00（每行都出现）
MEth0/0/1: 0.50（2行出现）
GigabitEthernet0/1/2: 0.25（1行出现）

3. 方案二：调整TF-IDF适配需求

如果你想用TF-IDF，可以把每个ID下的每一行text作为一个独立文档，计算词在这些文档中的TF-IDF均值——这个值越高，说明词在该ID的多个文本行中越重要：

# 为每个ID构建子文档集合
id_corpus = {}
for target_id in data['id'].unique():
    id_lines = data[data['id'] == target_id]['text'].tolist()
    # 把每行预处理后的词拼接成字符串，适配TF-IDF输入格式
    id_corpus[target_id] = [' '.join(process_line(line)) for line in id_lines]

# 对每个ID单独计算TF-IDF
for idx, docs in id_corpus.items():
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(docs)
    feature_names = vectorizer.get_feature_names_out()
    # 计算每个词的平均TF-IDF分数
    mean_tfidf = tfidf_matrix.mean(axis=0).tolist()[0]
    word_scores = dict(zip(feature_names, mean_tfidf))
    
    # 排序输出
    sorted_words = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)
    print(f"\nID {idx} 的TF-IDF高频词：")
    for word, score in sorted_words:
        print(f"- {word}: {score:.4f}")

关键注意事项

术语统一：比如GigabitEthernet0/1/2和Gi0/1/2是同一个接口的不同写法，如果需要合并统计，得先做术语映射（比如把缩写替换成全名）。
停用词优化：可以根据你的日志场景，补充更多无意义停用词（比如on、node这类通用词），进一步过滤干扰项。
分词精度：如果日志有复杂格式，可替换成更精准的分词工具（比如针对网络设备日志的自定义分词规则）。

内容的提问来源于stack exchange，提问作者Mohan