You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

R读取CSV时区分列分隔符逗号与小数分隔符逗号的解决方案求助

R读取CSV时区分列分隔符逗号与小数分隔符逗号的解决方案求助

我最近在读取CSV文件时遇到了一个头疼的问题,想请大家帮忙看看怎么解决。

这个CSV文件用逗号作为列分隔符,但有些数值(比如百分比)里的小数分隔符也是逗号,比如"5,78 %"这种格式——带这种逗号的数值会被双引号括起来。我本来以为用read.csv()或者tidyverse里的read_csv()就能正常解析,毕竟它们默认都是用双引号作为引用符号,但实际试下来完全不行:

  • read.csv()直接报错,说列数不匹配
  • read_csv()倒是能读,但整个文件被当成了单独一列

我也试过用tidyr::separate_wider_delim()来拆分列,但核心问题还是没法区分“作为列分隔符的逗号”和“作为小数/百分比分隔符的逗号”。

下面是我整理的测试数据,把这段代码复制生成数据框,或者存成CSV文件,就能复现我遇到的问题:

structure(list(browser = c("label,nb_v,nb_a,cr,nb_ac,avg,br", 
"DE,25,127,\"5,78 %\",9,00:16:59,14 %", "DK,1,9,0 %,9,00:02:57,0 %", 
"EN,1,18,100 %,18,00:28:15,0 %")), row.names = c(NA, -4L), spec = structure(list(
    cols = list(browser = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

后来我又提取了CSV文件的纯文本内容(脱敏处理后),大家可以更直观看到文件的格式:

xxxxxxxxüxxxxxxxk
xx_xxxq_vxxxtxxx,xx_vxxxtx,xx_xxtxxxx,xxx_xxtxxxx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
165,117,1555,56,5,11:16:55,15 %
xxxx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxvxxxx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
xxxtxxxxxxx,115,1516,1 €,5,11:17:15,15 %
xäxxxxxk,1,5,1 €,5,11:11:57,1 %
Jxxxx,1,1,1 €,1,11:11:11,111 %
xxxxxxx,1,15,1 €,15,11:15:15,1 %
xxxwxxxxxxxxxx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxxvxxxxxx_xxtx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
"xxxtxxx,115,1517,""5,55 %"",5,11:16:55,15 %"
xäxxxxx,1,5,1 %,5,11:11:57,1 %
xxxxxxxx,1,15,111 %,15,15:15:15,1 %
xxxätxtyx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxvxxxx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
xxxktxx,115,1551,1 €,5.5,11:17:15,15 %
xxxxtxxxxx,5,15,1 €,5,11:15:11,55 %
Txxxxt,1,55,1 €,55,11:16:56,1 %
xxxätxxxxxxx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxvxxxx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
xxxxxxxxxxx xxxktxx,115,1516,1 €,5,11:17:16,15 %
xxxxx - xxxxxxxxxxx xxxktxx,5,55,1 €,6.1,11:11:55,11 %
xxxxxxx - xxxxxy x55 5x,1,11,1 €,6,11:15:55,51 %
"xxxxxxx - xxxxxy Txx x5 Fx 11.5"" 5x,1,55,1 €,55,11:16:56,1 %"
xxxxxxx - xxxxxy Xxxvxx 6 xxx,1,5,1 €,5,11:11:51,1 %
xxxwxxx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxvxxxx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
xxxxxx,151,1555,1 €,5.5,11:16:11,15 %
xxxxxxxft xxxx,55,555,1 €,11.1,11:11:15,5 %
xxxxxx xxxxxx,5,15,1 €,5,11:15:11,55 %
xxxxxxx xxxx Wxxxxxtxxxx xxxxx,xx_vxxxtx,xx_xxtxxxx,xxxvxxxxxx_xxtx,xx_xxtxxxx_xxx_vxxxt,xvx_txxx_xx_xxtx,xxxxxx_xxtx
"xxxtxx,55,551,""11,55 %"",11.5,11:17:56,11 %"
"xxxxxtxx,55,517,""5,17 %"",5.5,11:17:15,15 %"
"xxttwxxx,61,567,""5,55 %"",7.5,11:15:11,11 %"
"xxxxxxxtxx,55,151,""5,55 %"",7.5,11:17:15,11 %"
"Fxxxtxx,55,555,""5,51 %"",11.6,11:15:55,6 %"
xxxxtxx,1,5,1 %,5,11:11:57,1 %
xxxxtxx,1,15,1 %,7.5,11:16:11,1 %
xktxxxxx - Kxxxxxtxxkxx xx_xxxxvxxwx,xx_xxxq_xxxxvxxwx,xx_xxwxxxxxx,xx_xxxq_xxwxxxxxx,xx_xxtxxxkx,xx_xxxq_xxtxxxkx,xx_xxxxxxxx,xx_kxywxxxx,xxtx
565,551,1,1,1,1,1,1,565
xxxtxx xxx xxxxx,xx_vxxxtx,xx_xxtx,xxxxxx_xxtx,xvx_txxx_xx_xxxx,xxxt_xxtx
/xxxxx/xxtxxxxxtxxxxfxxxxxx/xxxx/xxxx/xxx-1115,55,115,16 %,11:11:16,16 %
/xxxxx/xxtxxxxxtxxxxfxxxxxx/xxxx/xxxx/kxxtxxxxxxxxtxx-xxwx-1115,55,55,56 %,11:11:55,51 %

我用read_csv()读取后,发现核心问题是引用格式的解析错误。现在我最想实现的是:把文件正确拆分成目标列数,比如把"5,78%"这种带内部逗号的内容完整保留为某一列的值。

有没有朋友遇到过类似的问题?或者有什么方法能正确解析这种格式的CSV文件吗?

内容来源于stack exchange

火山引擎 最新活动