Python中机器学习回归任务的预测区间(Prediction Intervals)计算求教

Python中机器学习回归任务的预测区间(Prediction Intervals)计算求教

阿华AIGC实验室

2026-5-11

嘿，我之前正好处理过类似的场景——针对非正态输出的集成模型做预测区间，确实不像ARIMA或Prophet那样有现成的一键函数，但有几个非常实用的方案，咱们一步步拆解：

方法1：基于残差的Bootstrap预测区间（最通用）

这个方法完全不依赖输出的分布假设，特别适合你这种付款天数非正态的情况。核心思路是利用模型的预测残差（真实值-预测值）来模拟不确定性：

第一步：用训练好的集成模型算出训练集的预测值，得到所有样本的残差
第二步：重复几百到上千次Bootstrap采样：
- 从残差里随机有放回抽取一批，加到测试集/新数据的预测值上，得到一组模拟的预测结果
第三步：对每个样本的所有模拟预测值取分位数（比如2.5%和97.5%），就是95%预测区间的上下限

如果你的集成模型训练成本很高（比如弱学习器很多），推荐用残差Bootstrap，代码示例如下：

import numpy as np
# 替换成你自己的集成模型导入和数据
from your_ensemble import trained_ensemble_model
X_train, y_train = your_training_data
X_test = your_test_data

# 计算训练集残差
y_pred_train = trained_ensemble_model.predict(X_train)
residuals = y_train - y_pred_train

# 设置Bootstrap次数，次数越多区间越稳定，也越耗时
n_bootstrap = 1000
bootstrap_predictions = []

for _ in range(n_bootstrap):
    # 随机采样残差，和测试集样本数一致
    sampled_residuals = np.random.choice(residuals, size=len(X_test), replace=True)
    # 生成带残差的模拟预测
    test_pred = trained_ensemble_model.predict(X_test) + sampled_residuals
    bootstrap_predictions.append(test_pred)

# 转换为数组方便计算分位数
bootstrap_predictions = np.array(bootstrap_predictions)

# 计算95%预测区间
lower_pi = np.percentile(bootstrap_predictions, 2.5, axis=0)
upper_pi = np.percentile(bootstrap_predictions, 97.5, axis=0)

注意：这个方法假设残差是独立同分布的，如果你的数据有时间相关性（比如发票是按时间序列来的），最好先检查残差有没有自相关，必要时调整采样方式。

方法2：XGBoost多模型集成的不确定性估计

既然你的元学习器是XGBoost，我们可以利用随机性来模拟不确定性——训练多个参数略有不同的XGBoost模型，用它们的预测分布来生成区间：

给每个XGBoost模型设置不同的subsample（样本采样率）和colsample_bytree（特征采样率），模拟随机森林的随机性
收集所有模型对同一个样本的预测值，取分位数得到区间

代码示例：

import xgboost as xgb
import numpy as np

# 定义一组带随机性的XGBoost参数
model_params = [
    {"subsample": 0.8, "colsample_bytree": 0.8, "objective": "reg:squarederror"},
    {"subsample": 0.7, "colsample_bytree": 0.9, "objective": "reg:squarederror"},
    {"subsample": 0.9, "colsample_bytree": 0.7, "objective": "reg:squarederror"},
    # 可以继续加更多参数组合
]

# 训练多个XGBoost模型
ensemble_models = []
for params in model_params:
    model = xgb.XGBRegressor(**params)
    model.fit(X_train, y_train)
    ensemble_models.append(model)

# 生成多模型预测
all_predictions = []
for model in ensemble_models:
    all_predictions.append(model.predict(X_test))
all_predictions = np.array(all_predictions)

# 计算95%预测区间
lower_pi = np.percentile(all_predictions, 2.5, axis=0)
upper_pi = np.percentile(all_predictions, 97.5, axis=0)

这个方法的好处是不需要重新训练整个弱学习器集成，只需要调整元学习器的训练流程，计算速度更快。

方法3：分位数回归（直接建模分位数）

分位数回归可以直接让模型预测指定分位数的结果，完美适配非正态输出的场景。XGBoost原生支持分位数回归目标，你可以分别训练预测2.5%、50%（中位数）、97.5%分位数的模型，直接得到预测区间：

import xgboost as xgb

# 训练2.5%分位数模型（对应区间下限）
model_lower = xgb.XGBRegressor(objective="reg:quantileerror", alpha=0.025)
model_lower.fit(X_train, y_train)

# 训练中位数模型（可选，就是普通预测值）
model_median = xgb.XGBRegressor(objective="reg:squarederror")
model_median.fit(X_train, y_train)

# 训练97.5%分位数模型（对应区间上限）
model_upper = xgb.XGBRegressor(objective="reg:quantileerror", alpha=0.975)
model_upper.fit(X_train, y_train)

# 生成预测区间
lower_pi = model_lower.predict(X_test)
upper_pi = model_upper.predict(X_test)
y_pred_median = model_median.predict(X_test)

适配你的集成模型：你可以把弱学习器的输出作为XGBoost元学习器的输入，然后用分位数回归训练元学习器，这样得到的就是最终集成模型的预测区间，非常贴合你的场景。

总结

如果数据量不大、模型训练速度可接受，Bootstrap是最通用的方案；
如果追求效率，分位数回归直接建模分位数，不需要重复采样；
用XGBoost多模型集成的方式，最贴合你的元学习器架构，实现成本低。

内容的提问来源于stack exchange，提问作者azo91

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠