基于Yarin Gal建议：R语言Keras卷积神经网络Monte Carlo Dropout实现与预测不确定性估计

阿华AIGC实验室

2026-5-26

在R Keras中实现Monte Carlo Dropout（适配小批次训练与评估）

刚好我之前在R Keras里落地过Monte Carlo Dropout（MCDO），完全贴合Yarin Gal的核心思路——在推理阶段也启用Dropout，通过多次采样来估计预测不确定性。下面一步步给你拆解，覆盖你提到的小批次训练和评估需求：

核心原理回顾

Yarin Gal的核心观点是：Dropout本质上是对模型后验分布的近似。训练时用Dropout做正则化，推理时开启Dropout相当于从这个后验分布中采样；对同一个输入做N次采样后，预测结果的均值就是最终预测值，方差就是不确定性的度量。

步骤1：构建支持动态Dropout的CNN模型

关键是不要固定Dropout层的training状态，这样后续训练和推理时可以动态切换：

library(keras)

# 构建带可动态切换Dropout的CNN
model <- keras_model_sequential() %>%
  # 输入层+卷积层
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = "relu", input_shape = c(28,28,1)) %>%
  # 核心：设置training=NULL，不固定Dropout状态
  layer_dropout(rate = 0.25, training = NULL) %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  # 第二组卷积
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu") %>%
  layer_dropout(rate = 0.25, training = NULL) %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  # 全连接层
  layer_flatten() %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(rate = 0.5, training = NULL) %>%
  layer_dense(units = 10, activation = "softmax")

# 编译模型
model %>% compile(
  optimizer = optimizer_adam(learning_rate = 1e-3),
  loss = "sparse_categorical_crossentropy",
  metrics = "accuracy"
)

步骤2：小批次训练模型

你提到训练时已经设置了training=TRUE，这里分两种常用场景：

情况1：用默认`fit()`函数

Keras的fit()会自动在训练阶段把Dropout层设为training=TRUE，直接用小批次训练即可：

# 用MNIST示例数据
mnist <- dataset_mnist()
x_train <- array_reshape(mnist$train$x, c(60000, 28, 28, 1)) / 255
y_train <- mnist$train$y

# 小批次训练
model %>% fit(
  x = x_train, y = y_train,
  batch_size = 128,  # 自定义批次大小
  epochs = 10,
  validation_split = 0.1
)

情况2：自定义训练循环（比如用`train_on_batch`）

如果是手动写训练循环，需要确保每个批次前强制开启Dropout：

epochs <- 10
batch_size <- 128
num_batches <- ceiling(nrow(x_train) / batch_size)

for (epoch in 1:epochs) {
  cat("Epoch", epoch, "\n")
  epoch_loss <- 0
  
  for (batch in 1:num_batches) {
    # 提取当前批次数据
    start_idx <- (batch - 1) * batch_size + 1
    end_idx <- min(batch * batch_size, nrow(x_train))
    x_batch <- x_train[start_idx:end_idx,,,]
    y_batch <- y_train[start_idx:end_idx]
    
    # 手动设置所有Dropout层为training=TRUE
    lapply(model$layers, function(layer) {
      if ("dropout" %in% class(layer)) {
        k_set_value(layer$training, TRUE)
      }
    })
    
    # 训练当前批次
    batch_loss <- model %>% train_on_batch(x_batch, y_batch)
    epoch_loss <- epoch_loss + batch_loss[[1]]
    
    cat("Batch", batch, "Loss:", round(batch_loss[[1]], 4), "\n")
  }
  
  cat("Epoch", epoch, "Avg Loss:", round(epoch_loss/num_batches, 4), "\n\n")
}

步骤3：小批次Monte Carlo预测与不确定性估计

这是最关键的一步——对测试数据分批次做N次采样（每次开启Dropout），然后计算均值和方差：

# 定义Monte Carlo预测函数，支持小批次
mc_predict <- function(model, x, n_samples = 50, batch_size = 128) {
  # 初始化存储所有采样结果的数组
  pred_shape <- c(nrow(x), ncol(model$output), n_samples)
  predictions <- array(0, dim = pred_shape)
  
  # 计算总批次数
  num_batches <- ceiling(nrow(x) / batch_size)
  
  for (batch in 1:num_batches) {
    # 提取当前批次数据
    start_idx <- (batch - 1) * batch_size + 1
    end_idx <- min(batch * batch_size, nrow(x))
    x_batch <- x[start_idx:end_idx,,,]
    
    # 对当前批次做n_samples次采样
    for (sample in 1:n_samples) {
      # 核心：predict时设置training=TRUE，启用Dropout采样
      pred_batch <- model %>% predict(x_batch, training = TRUE)
      predictions[start_idx:end_idx,,sample] <- pred_batch
    }
  }
  
  # 计算每个样本的预测均值（最终预测）和方差（不确定性）
  pred_mean <- apply(predictions, c(1,2), mean)
  pred_var <- apply(predictions, c(1,2), var)
  
  # 返回结果：均值、方差、所有采样
  list(
    mean = pred_mean,
    variance = pred_var,
    all_samples = predictions
  )
}

使用示例

# 准备测试数据
x_test <- array_reshape(mnist$test$x, c(10000, 28, 28, 1)) / 255

# 执行Monte Carlo预测（30次采样，批次128）
mc_results <- mc_predict(model, x_test, n_samples = 30, batch_size = 128)

# 查看第一个测试样本的结果
cat("第一个样本的预测类别：", which.max(mc_results$mean[1,]) - 1, "\n")
cat("第一个样本的最大类别不确定性（方差）：", round(max(mc_results$variance[1,]), 4), "\n")

关键细节说明

为什么Dropout层要设training=NULL？
如果固定training=TRUE或FALSE，后续无法动态切换状态——训练时需要开启，推理时也需要开启（这是MCDO的核心），所以必须留空让predict()时手动指定。
采样次数怎么选？
一般30-50次采样就足够得到稳定的均值和方差；如果追求更高精度，可以增加到100次，但会增加计算时间。
小批次的必要性
当测试数据量很大时，一次性做N次采样会占用大量内存，分批次处理可以避免内存溢出，同时保持计算效率。

内容的提问来源于stack exchange，提问作者Ehtasham Billah Mymun