You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Pandas条件词频统计问题:自定义词表词频计算代码报错排查

问题分析与修复:统计DataFrame文本中指定词汇的出现次数

先还原下你的场景:你有这样一个DataFrame:

import pandas as pd

data = {'speaker':['Adam','Ben','Clair'], 
        'speech': ['Thank you very much and good afternoon.', 
                   'Let me clarify that because I want to make sure we have got everything right', 
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

想要统计每行speech列里,预定义词表wordlist = ['much', 'good','right']中词汇的出现次数,生成新列展示结果,但你写的代码运行后total列全是0,咱们来看看问题出在哪:

你的代码存在的两个核心问题

df['total'] = 0
for word in df['speech'].str.split():
    if word in wordlist:
        df['total'] += 1
  1. 循环迭代的对象错误df['speech'].str.split()返回的是一个Series,每个元素是整行文本拆分后的单词列表(比如第一行是['Thank', 'you', 'very', 'much', 'and', 'good', 'afternoon.'])。你这里循环的word其实是整个列表,而wordlist里都是单个字符串,所以if word in wordlist永远为False,根本不会执行df['total'] +=1
  2. 即使匹配成功,列操作逻辑错误:就算你能遍历到单个单词,直接df['total'] +=1会把整个列的所有行都加1,而不是给当前行的计数加1,这也不符合你的需求。

正确的实现方法

这里推荐两种简洁高效的方式:

方法1:用apply逐行处理

对每行的speech文本拆分后,统计词表中词汇的出现次数:

wordlist = ['much', 'good','right']

# 定义一个函数,统计单句中词表词汇的出现次数
def count_target_words(sentence):
    words = sentence.split()
    # 可选:如果要忽略大小写或者标点(比如afternoon.里的句号),可以先做清洗:
    # words = [w.strip('.!?').lower() for w in sentence.split()]
    return sum(1 for w in words if w in wordlist)

# 应用到speech列,生成新列
df['words'] = df['speech'].apply(count_target_words)

方法2:用str.count结合正则(更高效,适合大数据集)

把词表拼成正则表达式,匹配整个单词(避免部分匹配,比如不会把'goodbye'误判为包含'good'),然后统计每行的匹配次数:

import re

wordlist = ['much', 'good','right']
# 拼成正则,\b表示单词边界
pattern = r'\b(' + '|'.join(wordlist) + r')\b'

df['words'] = df['speech'].str.count(pattern)

运行后就能得到你预期的结果:

speakerspeechwords
AdamThank you very much and good afternoon.2
BenLet me clarify that because I want to make sure we have got everything right1
ClairBy now you should have some good rest1

内容的提问来源于stack exchange,提问作者Tao Han

火山引擎 最新活动