如何在R的dplyr中跨数据框处理数据并让df2遵循df1分组规则
解决方法:让df2跟随df1的分组规则计算
要让df2在每个reporter+year分组下对应计算总和,核心是在dplyr的分组环境中,根据当前组的reporter和year值筛选df2的对应行,再求和。
直接修改原代码的版本
你可以在summarize里直接基于当前分组的reporter和year过滤df2,替换原来全局求和的部分:
df1 %>% group_by(reporter, year) %>% summarize( TI = sum(trade_value_usd[partner != "World"], na.rm = TRUE) / sum(trade_value_usd[partner == "World"], na.rm = TRUE) / (sum(df2$trade_value_usd[df2$reporter == reporter & df2$year == year], na.rm = TRUE) / sum(df3$trade_value_usd, na.rm = TRUE)) )
更清晰易读的拆分版本
如果觉得上面的代码太紧凑,也可以把每个计算步骤拆分出来,方便调试和维护:
df1 %>% group_by(reporter, year) %>% summarize( # 计算df1分组内非World伙伴的贸易额总和 non_world_total = sum(trade_value_usd[partner != "World"], na.rm = TRUE), # 计算df1分组内World伙伴的贸易额总和 world_total_df1 = sum(trade_value_usd[partner == "World"], na.rm = TRUE), # 计算df2中与当前分组匹配的reporter+year的贸易额总和 total_df2 = sum(df2$trade_value_usd[df2$reporter == reporter & df2$year == year], na.rm = TRUE), # 计算df3的全局贸易额总和(如果df3也需要分组,逻辑和df2一致) total_df3 = sum(df3$trade_value_usd, na.rm = TRUE), # 最终计算TI TI = non_world_total / world_total_df1 / (total_df2 / total_df3) )
大数据量下的优化方案
如果你的数据量很大,上面的方法在每个分组里都筛选一次df2可能效率不高。可以先提前对df2按reporter和year分组求和,再和df1的分组结果关联:
# 先预处理df2,得到每个reporter+year的贸易额总和 df2_grouped <- df2 %>% group_by(reporter, year) %>% summarize(total_df2 = sum(trade_value_usd, na.rm = TRUE)) # 处理df1并关联预处理后的df2数据 df1 %>% group_by(reporter, year) %>% summarize( non_world_total = sum(trade_value_usd[partner != "World"], na.rm = TRUE), world_total_df1 = sum(trade_value_usd[partner == "World"], na.rm = TRUE) ) %>% left_join(df2_grouped, by = c("reporter", "year")) %>% mutate( total_df3 = sum(df3$trade_value_usd, na.rm = TRUE), TI = non_world_total / world_total_df1 / (total_df2 / total_df3) )
关键说明
- 上面的代码假设
df2中也存在reporter和year列,用来和df1的分组键匹配。如果df2的列名不同(比如reporting_country代替reporter),需要对应调整筛选条件里的列名。 - 如果
df3也需要跟随相同的分组规则计算总和,只需要把sum(df3$trade_value_usd)替换成和df2一样的筛选求和逻辑即可。
内容的提问来源于stack exchange,提问作者Andrea Dalla Rosa




