如何用dplyr更优雅地按年度拆分医院数据的日期行?
问题:拆分跨年度的医院数据记录
我有一份医院数据,包含多个变量及每行数据的生效起始日期(from_date)与结束日期(to_date),每行数据的最长有效期为一年。示例数据如下:
test = data.frame( ID=c(10,10,10,12,12), Disease=c("P","P","P","D","P"), Pass=c("US","US","US","EN","EN"), Payment=c(110,110,115,240,255), from_date=as.POSIXct(c("2008-01-09","2009-01-09","2010-01-09","2008-01-01","2013-12-31")), to_date=as.POSIXct(c("2009-01-08","2010-01-08","2011-01-08","2008-12-31","2014-12-30")) )
对于跨年度的行,我需要将其拆分为两行,并调整from_date和to_date,最终得到如下目标数据集:
test_desired = data.frame( ID=c(10,10,10,10,10,10,12,12,12), Disease=c("P","P","P","P","P","P","D","P","P"), Pass=c("US","US","US","US","US","US","EN","EN","EN"), Payment=c(110,110,110,110,115,115,240,255,255), from_date=as.POSIXct(c("2008-01-09","2009-01-01","2009-01-09","2009-01-01","2010-01-09","2011-01-01","2008-01-01","2013-12-31","2014-01-01")), to_date=as.POSIXct(c("2008-12-31","2009-01-08","2009-12-31","2010-01-08","2010-12-31","2011-01-08","2008-12-31","2013-12-31","2014-12-30")) )
我已经尝试用循环结合lubridate包实现了需求,代码如下:
library(lubridate) #for function "year" below test_desired=test row=c() tmp=c() for(i in 1:nrow(test_desired)){ if(year(test_desired$from_date)[i]<year(test_desired$to_date)[i]){ test_desired$to_date[i] = as.POSIXct(paste0(year(test_desired$from_date[i]),"-12-31")) row = test_desired[i,] row$from_date = as.POSIXct(paste0(year(test$to_date[i]),"-01-01")) row$to_date = test$to_date[i] tmp=rbind(tmp,row) } else next } test_desired=rbind(test_desired,tmp) library(dplyr) test_desired=arrange(test_desired,ID,from_date)
想请教有没有更优雅的实现方式,比如用dplyr包?
解决方案:用dplyr + lubridate实现向量化操作
当然有!我们可以利用dplyr的行操作工具,配合lubridate的日期处理函数,实现完全的向量化处理,避免循环,代码更简洁易读:
library(dplyr) library(lubridate) test_processed <- test %>% rowwise() %>% # 生成拆分后的日期区间列表 mutate( date_ranges = list( if (year(from_date) != year(to_date)) { list( tibble(from_date = from_date, to_date = ymd(paste0(year(from_date), "-12-31"))), tibble(from_date = ymd(paste0(year(to_date), "-01-01")), to_date = to_date) ) %>% bind_rows() } else { tibble(from_date = from_date, to_date = to_date) } ) ) %>% # 展开日期区间列表,拆分成多行 unnest(date_ranges) %>% # 清理临时列并排序 select(-date_ranges) %>% arrange(ID, from_date)
代码说明:
rowwise():让后续的mutate逐行处理每条记录,确保我们能单独检查每行的日期是否跨年度。date_ranges列:对跨年度的记录生成包含两个日期区间的小表格;非跨年度的记录则保留原日期区间。unnest(date_ranges):自动把每个列表里的日期区间展开成独立行,完成拆分操作。- 最后排序:和你原来的逻辑一致,按
ID和from_date整理结果。
你可以对比test_processed和目标数据集test_desired,结果完全一致,而且代码更符合tidyverse的风格,维护起来也更方便~
内容的提问来源于stack exchange,提问作者Helen




