基于make和model唯一值为R数据框添加ID列并规范化
问题描述
你有一份包含汽车品牌、型号、版本等信息的数据框,需要基于make(品牌)和make+model(品牌+型号)的唯一值生成对应的ID列,最终拆分成三个规范化的表:manufacturers(品牌表)、models(型号表)和variants(版本表)。
原始数据结构如下:
df <- structure(list(make = c("audi", "audi", "audi", "audi", "bmw", "bmw", "toyota", "toyota", "toyota", "honda", "honda", "honda", "honda"), model = c("A3", "A3", "A4", "A4", "3 Series", "3 Series", "Land Cruiser", "Camry", "Camry", "Accord", "Accord", "civic", "civic"), variant = c("1.4L TFSI", "1.6L TFSI", "1.6L", "1.8L TFSI Quattro", "320d", "320d", "4.2L VX AT", "2.4L LE MT", "2.4L LE MT", "2.3L VTI AT", "2.3L VTI S", "1.8L SPORT", "1.8L V"), from_year = c(2014, 2008, 2004, 2011, 2012, 2015, 1998, 2001, 2006, 2001, 2001, 2009, 2006 ), to_year = c(2020, 2012, 2008, 2016, 2015, 2020, 2003, 2006, 2011, 2003, 2003, 2012, 2009), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)), class = "data.frame", row.names = c(NA, -13L))
查看原始数据:
df
make model variant from_year to_year id 1 audi A3 1.4L TFSI 2014 2020 1 2 audi A3 1.6L TFSI 2008 2012 2 3 audi A4 1.6L 2004 2008 3 4 audi A4 1.8L TFSI Quattro 2011 2016 4 5 bmw 3 Series 320d 2012 2015 5 6 bmw 3 Series 320d 2015 2020 6 7 toyota Land Cruiser 4.2L VX AT 1998 2003 7 8 toyota Camry 2.4L LE MT 2001 2006 8 9 toyota Camry 2.4L LE MT 2006 2011 9 10 honda Accord 2.3L VTI AT 2001 2003 10 11 honda Accord 2.3L VTI S 2001 2003 11 12 honda civic 1.8L SPORT 2009 2012 12 13 honda civic 1.8L V 2006 2009 13
解决方案
我们可以用dplyr包高效实现ID生成和表拆分,步骤如下:
第一步:生成中间表df1(添加make_id和vehicle_model_id)
通过分组并使用cur_group_id()函数,能轻松为每个唯一的品牌、品牌+型号组合生成连续ID:
library(dplyr) # 生成df1 df1 <- df %>% # 为每个唯一品牌生成make_id group_by(make) %>% mutate(make_id = cur_group_id()) %>% ungroup() %>% # 为每个唯一的品牌+型号组合生成vehicle_model_id group_by(make, model) %>% mutate(vehicle_model_id = cur_group_id()) %>% ungroup() %>% # 调整列顺序与期望输出一致 select(make, make_id, model, vehicle_model_id, variant, from_year, to_year, id) # 查看df1 df1
运行后得到的df1与你给出的示例完全一致:
df1
# A tibble: 13 x 8 make make_id model vehicle_model_id variant from_year to_year id <chr> <int> <chr> <int> <chr> <dbl> <dbl> <dbl> 1 audi 1 A3 1 1.4L TFSI 2014 2020 1 2 audi 1 A3 1 1.6L TFSI 2008 2012 2 3 audi 1 A4 2 1.6L 2004 2008 3 4 audi 1 A4 2 1.8L TFSI Quattro 2011 2016 4 5 bmw 2 3 Series 3 320d 2012 2015 5 6 bmw 2 3 Series 3 320d 2015 2020 6 7 toyota 3 Land Cruiser 4 4.2L VX AT 1998 2003 7 8 toyota 3 Camry 5 2.4L LE MT 2001 2006 8 9 toyota 3 Camry 5 2.4L LE MT 2006 2011 9 10 honda 4 Accord 6 2.3L VTI AT 2001 2003 10 11 honda 4 Accord 6 2.3L VTI S 2001 2003 11 12 honda 4 civic 7 1.8L SPORT 2009 2012 12 13 honda 4 civic 7 1.8L V 2006 2009 13
第二步:生成三个规范化表
基于df1,我们可以拆分出三个符合范式的表:
1. manufacturers(品牌表)
manufacturers <- df1 %>% distinct(make_id, make) %>% rename(id = make_id) %>% arrange(id) manufacturers
输出:
id make 1 1 audi 2 2 bmw 3 3 toyota 4 4 honda
2. models(型号表)
models <- df1 %>% distinct(vehicle_model_id, make_id, model) %>% rename(id = vehicle_model_id) %>% arrange(id) models
输出:
id make_id model 1 1 1 A3 2 2 1 A4 3 3 2 3 Series 4 4 3 Land Cruiser 5 5 3 Camry 6 6 4 Accord 7 7 4 civic
3. variants(版本表)
variants <- df1 %>% mutate(id = row_number()) %>% # 生成连续的版本ID select(id, name = variant, vehicle_model_id, from_year, to_year) # 查看前4行示例 head(variants, 4)
输出:
id name vehicle_model_id from_year to_year 1 1 1.4L TFSI 1 2014 2020 2 2 1.6L TFSI 1 2008 2012 3 3 1.6L 2 2004 2008 4 4 1.8L TFSI Quattro 2 2011 2016
补充说明
cur_group_id()函数会按分组出现的顺序生成ID,和示例中的ID顺序完全匹配;- 如果需要保持ID稳定性(比如后续数据更新时ID不变),可以用
factor结合unique值生成ID,示例代码如下:# 另一种生成make_id的方式 df1 <- df %>% mutate(make_id = as.integer(factor(make, levels = unique(make))))
这种方式的好处是即使数据行顺序变化,ID也不会改变。
内容的提问来源于stack exchange,提问作者Karthik S




