You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

基于make和model唯一值为R数据框添加ID列并规范化

问题描述

你有一份包含汽车品牌、型号、版本等信息的数据框,需要基于make(品牌)和make+model(品牌+型号)的唯一值生成对应的ID列,最终拆分成三个规范化的表:manufacturers(品牌表)、models(型号表)和variants(版本表)。

原始数据结构如下:

df <- structure(list(make = c("audi", "audi", "audi", "audi", "bmw", "bmw", "toyota", "toyota", "toyota", "honda", "honda", "honda", "honda"), model = c("A3", "A3", "A4", "A4", "3 Series", "3 Series", "Land Cruiser", "Camry", "Camry", "Accord", "Accord", "civic", "civic"), variant = c("1.4L TFSI", "1.6L TFSI", "1.6L", "1.8L TFSI Quattro", "320d", "320d", "4.2L VX AT", "2.4L LE MT", "2.4L LE MT", "2.3L VTI AT", "2.3L VTI S", "1.8L SPORT", "1.8L V"), from_year = c(2014, 2008, 2004, 2011, 2012, 2015, 1998, 2001, 2006, 2001, 2001, 2009, 2006 ), to_year = c(2020, 2012, 2008, 2016, 2015, 2020, 2003, 2006, 2011, 2003, 2003, 2012, 2009), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)), class = "data.frame", row.names = c(NA, -13L))

查看原始数据:

df

make       model           variant from_year to_year id
1  audi         A3        1.4L TFSI      2014    2020  1
2  audi         A3        1.6L TFSI      2008    2012  2
3  audi         A4             1.6L      2004    2008  3
4  audi         A4 1.8L TFSI Quattro      2011    2016  4
5   bmw    3 Series             320d      2012    2015  5
6   bmw    3 Series             320d      2015    2020  6
7 toyota Land Cruiser      4.2L VX AT      1998    2003  7
8 toyota       Camry      2.4L LE MT      2001    2006  8
9 toyota       Camry      2.4L LE MT      2006    2011  9
10 honda      Accord     2.3L VTI AT      2001    2003 10
11 honda      Accord      2.3L VTI S      2001    2003 11
12 honda      civic       1.8L SPORT      2009    2012 12
13 honda      civic         1.8L V        2006    2009 13
解决方案

我们可以用dplyr包高效实现ID生成和表拆分,步骤如下:

第一步:生成中间表df1(添加make_id和vehicle_model_id)

通过分组并使用cur_group_id()函数,能轻松为每个唯一的品牌、品牌+型号组合生成连续ID:

library(dplyr)

# 生成df1
df1 <- df %>%
  # 为每个唯一品牌生成make_id
  group_by(make) %>%
  mutate(make_id = cur_group_id()) %>%
  ungroup() %>%
  # 为每个唯一的品牌+型号组合生成vehicle_model_id
  group_by(make, model) %>%
  mutate(vehicle_model_id = cur_group_id()) %>%
  ungroup() %>%
  # 调整列顺序与期望输出一致
  select(make, make_id, model, vehicle_model_id, variant, from_year, to_year, id)

# 查看df1
df1

运行后得到的df1与你给出的示例完全一致:

df1

# A tibble: 13 x 8
   make  make_id model       vehicle_model_id variant           from_year to_year    id
   <chr>   <int> <chr>                  <int> <chr>                <dbl>   <dbl> <dbl>
 1 audi        1 A3                        1 1.4L TFSI             2014    2020     1
 2 audi        1 A3                        1 1.6L TFSI             2008    2012     2
 3 audi        1 A4                        2 1.6L                  2004    2008     3
 4 audi        1 A4                        2 1.8L TFSI Quattro     2011    2016     4
 5 bmw         2 3 Series                  3 320d                  2012    2015     5
 6 bmw         2 3 Series                  3 320d                  2015    2020     6
 7 toyota      3 Land Cruiser              4 4.2L VX AT            1998    2003     7
 8 toyota      3 Camry                     5 2.4L LE MT            2001    2006     8
 9 toyota      3 Camry                     5 2.4L LE MT            2006    2011     9
10 honda       4 Accord                    6 2.3L VTI AT           2001    2003    10
11 honda       4 Accord                    6 2.3L VTI S            2001    2003    11
12 honda       4 civic                     7 1.8L SPORT            2009    2012    12
13 honda       4 civic                     7 1.8L V                2006    2009    13

第二步:生成三个规范化表

基于df1,我们可以拆分出三个符合范式的表:

1. manufacturers(品牌表)

manufacturers <- df1 %>%
  distinct(make_id, make) %>%
  rename(id = make_id) %>%
  arrange(id)

manufacturers

输出:

id    make
1  1    audi
2  2     bmw
3  3  toyota
4  4   honda

2. models(型号表)

models <- df1 %>%
  distinct(vehicle_model_id, make_id, model) %>%
  rename(id = vehicle_model_id) %>%
  arrange(id)

models

输出:

id make_id       model
1  1       1         A3
2  2       1         A4
3  3       2    3 Series
4  4       3 Land Cruiser
5  5       3       Camry
6  6       4      Accord
7  7       4      civic

3. variants(版本表)

variants <- df1 %>%
  mutate(id = row_number()) %>% # 生成连续的版本ID
  select(id, name = variant, vehicle_model_id, from_year, to_year)

# 查看前4行示例
head(variants, 4)

输出:

id           name vehicle_model_id from_year to_year
1  1        1.4L TFSI                1      2014    2020
2  2        1.6L TFSI                1      2008    2012
3  3             1.6L                2      2004    2008
4  4 1.8L TFSI Quattro                2      2011    2016
补充说明
  • cur_group_id()函数会按分组出现的顺序生成ID,和示例中的ID顺序完全匹配;
  • 如果需要保持ID稳定性(比如后续数据更新时ID不变),可以用factor结合unique值生成ID,示例代码如下:
    # 另一种生成make_id的方式
    df1 <- df %>%
      mutate(make_id = as.integer(factor(make, levels = unique(make))))
    

这种方式的好处是即使数据行顺序变化,ID也不会改变。

内容的提问来源于stack exchange,提问作者Karthik S

火山引擎 最新活动