相同结构的两个数据集执行R分组汇总代码时一个报错的问题求助及原因咨询
相同结构的两个数据集执行R分组汇总代码时一个报错的问题求助及原因咨询
问题背景与报错信息
我最近遇到了一个奇怪的问题:两个结构完全一致的R数据集,用几乎一样的分组汇总代码处理,其中一个正常运行,另一个却报错了。报错内容如下:
Error in `mutate()`: ℹ In argument: `rel_hedge_booster_score = paste(breaks[-length(breaks)], breaks[-1], sep = " to ")`. Caused by error: ! `rel_hedge_booster_score` must be size 11 or 1, not 10.
这两个数据集分别是aus_pol_data(代码能正常运行)和cas_uk_data(触发上述报错)。它们的结构完全对齐:所有列名、变量类型一一对应,仅具体数值不同。其中关键列rel_hedge_booster_score是范围在-100到100的数值型变量,两个数据集都包含该范围内的连续值以及-100、100本身。
代码对比与问题定位
首先我在全局定义了分组断点:
breaks <<- seq(-100, 100, by = 20)
能正常运行的代码(处理aus_pol_data)
df_aus_pol_karma <<- aus_pol_data %>% mutate(range_label = cut(rel_hedge_booster_score, breaks = breaks, labels = FALSE, right = FALSE)) %>% group_by(range_label) %>% summarize( num_data_points = n(), median_karma_score = median(karma_score), mean_karma_score = mean(karma_score), karma_scores = list(karma_score) ) %>% mutate(rel_hedge_booster_score = paste(breaks[-length(breaks)], breaks[-1], sep = " to ")) %>% select(rel_hedge_booster_score, num_data_points, median_karma_score, mean_karma_score)
这段代码能生成我需要的结果,分组逻辑正常。
触发报错的代码(处理cas_uk_data)
df_cas_uk_karma <<- cas_uk_data %>% mutate(range_label = cut(rel_hedge_booster_score, breaks = breaks, labels = FALSE, right = FALSE)) %>% group_by(range_label) %>% summarize( num_data_points = n(), median_karma_score = median(karma_score), mean_karma_score = mean(karma_score), karma_scores = list(karma_score) ) %>% mutate(rel_hedge_booster_score = paste(breaks[-length(breaks)], breaks[-1], sep = " to ")) %>% select(rel_hedge_booster_score, num_data_points, median_karma_score, mean_karma_score)
两段代码仅数据集名称不同,但这段在执行mutate(rel_hedge_booster_score = ...)时触发了长度不匹配的报错。
分步调试后我发现两个关键点:
cas_uk_data经过group_by+summarize后,结果比aus_pol_data多一行(11行 vs 10行)- 同时分组标签显示异常(比如出现“0 to 20”这类不符合预期的范围),不过数据总数是正确的,只是标签显示有问题
已解决的方案
后来我调整了代码逻辑,彻底解决了问题,修改后的代码如下:
df_cas_uk_karma_table_data <<- cas_uk_data %>% mutate(rel_hedge_booster_score = cut(rel_hedge_booster_score, breaks = breaks, labels = paste(breaks[-length(breaks)], breaks[-1], sep = " to "), right = FALSE)) %>% group_by(rel_hedge_booster_score) %>% summarize( num_data_points = n(), median_karma_score = median(karma_score), mean_karma_score = mean(karma_score), karma_scores = list(karma_score) ) %>% select(rel_hedge_booster_score, num_data_points, median_karma_score, mean_karma_score)
核心修改是:直接在cut函数中生成对应的范围标签,而不是先生成数字型的range_label再后续手动拼接标签。这样group_by时直接使用正确的范围标签,避免了后续手动匹配长度时出现的不兼容问题。
备注:内容来源于stack exchange,提问作者Sikamixoticelixer




