Pandas条件词频统计问题:自定义词表词频计算代码报错排查
问题分析与修复:统计DataFrame文本中指定词汇的出现次数
先还原下你的场景:你有这样一个DataFrame:
import pandas as pd data = {'speaker':['Adam','Ben','Clair'], 'speech': ['Thank you very much and good afternoon.', 'Let me clarify that because I want to make sure we have got everything right', 'By now you should have some good rest']} df = pd.DataFrame(data)
想要统计每行speech列里,预定义词表wordlist = ['much', 'good','right']中词汇的出现次数,生成新列展示结果,但你写的代码运行后total列全是0,咱们来看看问题出在哪:
你的代码存在的两个核心问题
df['total'] = 0 for word in df['speech'].str.split(): if word in wordlist: df['total'] += 1
- 循环迭代的对象错误:
df['speech'].str.split()返回的是一个Series,每个元素是整行文本拆分后的单词列表(比如第一行是['Thank', 'you', 'very', 'much', 'and', 'good', 'afternoon.'])。你这里循环的word其实是整个列表,而wordlist里都是单个字符串,所以if word in wordlist永远为False,根本不会执行df['total'] +=1。 - 即使匹配成功,列操作逻辑错误:就算你能遍历到单个单词,直接
df['total'] +=1会把整个列的所有行都加1,而不是给当前行的计数加1,这也不符合你的需求。
正确的实现方法
这里推荐两种简洁高效的方式:
方法1:用apply逐行处理
对每行的speech文本拆分后,统计词表中词汇的出现次数:
wordlist = ['much', 'good','right'] # 定义一个函数,统计单句中词表词汇的出现次数 def count_target_words(sentence): words = sentence.split() # 可选:如果要忽略大小写或者标点(比如afternoon.里的句号),可以先做清洗: # words = [w.strip('.!?').lower() for w in sentence.split()] return sum(1 for w in words if w in wordlist) # 应用到speech列,生成新列 df['words'] = df['speech'].apply(count_target_words)
方法2:用str.count结合正则(更高效,适合大数据集)
把词表拼成正则表达式,匹配整个单词(避免部分匹配,比如不会把'goodbye'误判为包含'good'),然后统计每行的匹配次数:
import re wordlist = ['much', 'good','right'] # 拼成正则,\b表示单词边界 pattern = r'\b(' + '|'.join(wordlist) + r')\b' df['words'] = df['speech'].str.count(pattern)
运行后就能得到你预期的结果:
| speaker | speech | words |
|---|---|---|
| Adam | Thank you very much and good afternoon. | 2 |
| Ben | Let me clarify that because I want to make sure we have got everything right | 1 |
| Clair | By now you should have some good rest | 1 |
内容的提问来源于stack exchange,提问作者Tao Han




