You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用dplyr更优雅地按年度拆分医院数据的日期行?

问题:拆分跨年度的医院数据记录

我有一份医院数据,包含多个变量及每行数据的生效起始日期(from_date)与结束日期(to_date),每行数据的最长有效期为一年。示例数据如下:

test = data.frame(
  ID=c(10,10,10,12,12), 
  Disease=c("P","P","P","D","P"), 
  Pass=c("US","US","US","EN","EN"), 
  Payment=c(110,110,115,240,255), 
  from_date=as.POSIXct(c("2008-01-09","2009-01-09","2010-01-09","2008-01-01","2013-12-31")), 
  to_date=as.POSIXct(c("2009-01-08","2010-01-08","2011-01-08","2008-12-31","2014-12-30")) 
)

对于跨年度的行,我需要将其拆分为两行,并调整from_dateto_date,最终得到如下目标数据集:

test_desired = data.frame(
  ID=c(10,10,10,10,10,10,12,12,12), 
  Disease=c("P","P","P","P","P","P","D","P","P"), 
  Pass=c("US","US","US","US","US","US","EN","EN","EN"), 
  Payment=c(110,110,110,110,115,115,240,255,255), 
  from_date=as.POSIXct(c("2008-01-09","2009-01-01","2009-01-09","2009-01-01","2010-01-09","2011-01-01","2008-01-01","2013-12-31","2014-01-01")), 
  to_date=as.POSIXct(c("2008-12-31","2009-01-08","2009-12-31","2010-01-08","2010-12-31","2011-01-08","2008-12-31","2013-12-31","2014-12-30")) 
)

我已经尝试用循环结合lubridate包实现了需求,代码如下:

library(lubridate) #for function "year" below
test_desired=test
row=c()
tmp=c()
for(i in 1:nrow(test_desired)){
 if(year(test_desired$from_date)[i]<year(test_desired$to_date)[i]){
 test_desired$to_date[i] = as.POSIXct(paste0(year(test_desired$from_date[i]),"-12-31"))
 row = test_desired[i,]
 row$from_date = as.POSIXct(paste0(year(test$to_date[i]),"-01-01"))
 row$to_date = test$to_date[i]
 tmp=rbind(tmp,row)
 } else next
}
test_desired=rbind(test_desired,tmp)
library(dplyr)
test_desired=arrange(test_desired,ID,from_date)

想请教有没有更优雅的实现方式,比如用dplyr包?


解决方案:用dplyr + lubridate实现向量化操作

当然有!我们可以利用dplyr的行操作工具,配合lubridate的日期处理函数,实现完全的向量化处理,避免循环,代码更简洁易读:

library(dplyr)
library(lubridate)

test_processed <- test %>%
  rowwise() %>%
  # 生成拆分后的日期区间列表
  mutate(
    date_ranges = list(
      if (year(from_date) != year(to_date)) {
        list(
          tibble(from_date = from_date, to_date = ymd(paste0(year(from_date), "-12-31"))),
          tibble(from_date = ymd(paste0(year(to_date), "-01-01")), to_date = to_date)
        ) %>% bind_rows()
      } else {
        tibble(from_date = from_date, to_date = to_date)
      }
    )
  ) %>%
  # 展开日期区间列表,拆分成多行
  unnest(date_ranges) %>%
  # 清理临时列并排序
  select(-date_ranges) %>%
  arrange(ID, from_date)

代码说明:

  1. rowwise():让后续的mutate逐行处理每条记录,确保我们能单独检查每行的日期是否跨年度。
  2. date_ranges:对跨年度的记录生成包含两个日期区间的小表格;非跨年度的记录则保留原日期区间。
  3. unnest(date_ranges):自动把每个列表里的日期区间展开成独立行,完成拆分操作。
  4. 最后排序:和你原来的逻辑一致,按IDfrom_date整理结果。

你可以对比test_processed和目标数据集test_desired,结果完全一致,而且代码更符合tidyverse的风格,维护起来也更方便~


内容的提问来源于stack exchange,提问作者Helen

火山引擎 最新活动