基于汇总数据绘制violin plot：格式转换及零值报错解决

阿华AIGC实验室

2026-5-15

解决考古文物数据的小提琴图绘制问题：从汇总数据到观测级数据&直接绘制方案

Hey there, let's break down your two questions and fix that annoying zero-value error you ran into with the eipi10-style approach.

1. 如何将汇总DataFrame转换为观测级DataFrame（并解决0值报错）

First, let's recap your data structure: you've got a wide-format summary table with a Year column, plus SiteA and SiteB columns holding the number of artifacts per year. The error pops up because when a site has 0 artifacts in a year, trying to repeat that row 0 times creates a length mismatch in the resulting data frame.

最简解决方案（用tidyverse的`uncount`）

This is the cleanest, most efficient way—we'll reshape to long format, filter out zero-count rows, then expand to observational level in one step:

library(tidyverse)

# 假设你的汇总数据框名为summary_df
observational_df <- summary_df %>%
  # 把宽格式转成长格式：Site列存遗址名，Count列存对应年份的文物数
  pivot_longer(cols = c(SiteA, SiteB), names_to = "Site", values_to = "Count") %>%
  # 关键：过滤掉文物数为0的行，避免后续展开时出现长度为0的向量
  filter(Count > 0) %>%
  # 按Count的值重复每行，直接生成每行对应一件文物的观测级数据
  uncount(Count)

手动rep方案（兼容0值，适合理解原理）

If you want to stick with a loop/rep approach (like the original eipi10 method), add a check to skip zero-count rows entirely:

library(tidyverse)

# 先转长格式
long_summary <- summary_df %>%
  pivot_longer(cols = c(SiteA, SiteB), names_to = "Site", values_to = "Count")

# 逐行处理，跳过Count=0的行
observational_df <- map_dfr(1:nrow(long_summary), function(i) {
  current_count <- long_summary$Count[i]
  if (current_count == 0) {
    return(NULL)  # 0件时不生成任何行
  } else {
    data.frame(
      Year = rep(long_summary$Year[i], current_count),
      Site = rep(long_summary$Site[i], current_count)
    )
  }
})

Both methods will give you a data frame where every row represents a single artifact, ready for geom_violin.

2. 是否可以直接用汇总数据绘制小提琴图？

Absolutely! You don't need to expand to observational data at all—this is actually more efficient, especially with 6000 artifacts. Violin plots are based on density estimates, and you can use the weight parameter to tell ggplot how many observations each row represents.

直接绘制代码

First reshape to long format (keep the Count column as weights), then plot:

library(tidyverse)

# 转长格式（保留Count列作为权重）
long_summary <- summary_df %>%
  pivot_longer(cols = c(SiteA, SiteB), names_to = "Site", values_to = "Count") %>%
  filter(Count > 0)  # 可选：去掉0值，因为权重0不会影响密度计算

# 绘制小提琴图，用weight参数指定每行的观测数量
ggplot(long_summary, aes(x = Site, y = Year, weight = Count)) +
  geom_violin(scale = "area") +  # scale="area"让不同遗址的小提琴面积一致
  labs(x = "遗址", y = "距今年份（Before Present）") +
  theme_bw()

调整细节

You can tweak the violin's smoothness with the bw (bandwidth) parameter to fit your year scale:

ggplot(long_summary, aes(x = Site, y = Year, weight = Count)) +
  geom_violin(scale = "area", bw = 500) +  # 调整带宽，适配你的年份跨度
  labs(x = "遗址", y = "距今年份（Before Present）") +
  theme_bw()

This approach avoids creating a 6000-row data frame and works just as well as using observational data.

内容的提问来源于stack exchange，提问作者Pertinax