使用R进行数据清理：移除数据中的多余句点

阿华AIGC实验室

2026-5-27

解决R中数据清理的多余句点问题

嘿，这个数据清理的场景我之前也碰到过！直接用全替换句点的方法肯定不行——会把正常的小数点位也删掉。我们得精准区分两种多余的句点：一种是数值末尾的冗余句点（比如102.25.），另一种是./../...这类空值占位符，要替换成NA。下面是具体的解决方案：

第一步：模拟输入数据

先把你的输入转换成R能处理的格式，这里假设是字符向量：

# 模拟你的原始输入数据
raw_data <- c(
  "100 | 101.25 | 102.25. | . | .. | 201.5.",
  "200.05. | 200.56. | 205 | .. | . | 3000",
  "300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65."
)

第二步：编写清理函数（用stringr包）

用stringr包的正则替换功能，精准处理两种情况：

library(stringr)

# 定义清理函数
clean_data <- function(input_str) {
  # 1. 把单独的 . / .. / ... 替换成 NA（匹配前后可能带空格的情况）
  step1 <- str_replace_all(input_str, "\\s*(\\.{1,3})\\s*", " NA ")
  
  # 2. 去掉数值末尾的多余句点（保留正常小数位）
  # 正则逻辑：捕获"数字+可选小数点+可选数字"，替换掉后面的所有句点
  step2 <- str_replace_all(step1, "(\\d+\\.?\\d*)\\.+", "\\1")
  
  # 3. 清理竖线前后的多余空格，让格式更整齐
  final <- str_squish(str_replace_all(step2, "\\s*\\|\\s*", " | "))
  
  return(final)
}

# 应用函数处理数据
cleaned_result <- clean_data(raw_data)

第三步：查看处理结果

运行后你会得到和期望一致的输出：

print(cleaned_result)
# 输出：
# [1] "100 | 101.25 | 102.25 | NA | NA | 201.5"
# [2] "200.05 | 200.56 | 205 | NA | NA | 3000"
# [3] "300.98 | 300.26 | 2001.56 | NA | 0.2 | 5.65"

如果你的数据是数据框（批量处理）

如果数据是数据框的列，用dplyr可以批量处理所有列：

library(dplyr)

# 模拟数据框格式的原始数据
raw_df <- tibble(
  col1 = c("100", "200.05.", "300.98"),
  col2 = c("101.25", "200.56.", "300.26."),
  col3 = c("102.25.", "205", "2001.56."),
  col4 = c(".", "..", "..."),
  col5 = c("..", ".", "0.2"),
  col6 = c("201.5.", "3000", "5.65.")
)

# 批量清理所有列
cleaned_df <- raw_df %>%
  mutate(across(everything(), ~ {
    # 替换仅含1-3个句点的单元格为NA
    .x <- str_replace_all(.x, "^\\.{1,3}$", "NA")
    # 去掉单元格末尾的多余句点
    str_replace_all(.x, "(\\d+\\.?\\d*)\\.+$", "\\1")
  }))

print(cleaned_df)