训练集与测试集分布差异下的训练集子采样方法咨询

阿华AIGC实验室

2026-5-19

Hey there! 这问题在数据科学竞赛里简直是家常便饭——训练集和测试集分布不一致（也就是常说的distribution shift），分分钟让你辛辛苦苦训的模型在测试集上翻车。不过别慌，咱们有一套成熟的操作流程来解决这个问题，一步步来：

1. 先量化分布差异，找准问题核心

在抽样之前，你得先搞清楚到底是哪些特征的分布和测试集差得远，不能盲目动手。毕竟分布差异可能只出现在少数关键特征上，抓准重点才能事半功倍。这里有几个实用的方法：

数值特征：用KS检验（Kolmogorov-Smirnov test）衡量两个分布的差异，p值越小（通常<0.05）说明分布差异越显著。
分类特征：用卡方检验（Chi-square test），同样通过p值判断分布是否有显著差异。
可视化辅助：画直方图、箱线图（数值特征）或条形图（分类特征），直观对比训练集和测试集的分布形态。

举个Python代码例子，快速排查差异：

import pandas as pd
from scipy.stats import ks_2samp, chi2_contingency

# 假设你已经有train和test两个DataFrame，先分好数值和分类特征
numerical_cols = [col for col in train.columns if train[col].dtype in ['int64', 'float64']]
categorical_cols = [col for col in train.columns if train[col].dtype == 'object']

# 检查数值特征
print("=== 数值特征分布差异 ===")
for col in numerical_cols:
    stat, p_val = ks_2samp(train[col], test[col])
    print(f"特征 {col}: KS统计量={stat:.3f}, p值={p_val:.3f}")

# 检查分类特征
print("\n=== 分类特征分布差异 ===")
for col in categorical_cols:
    contingency_table = pd.crosstab(train[col], test[col])
    stat, p_val, _, _ = chi2_contingency(contingency_table)
    print(f"特征 {col}: 卡方统计量={stat:.3f}, p值={p_val:.3f}")

重点关注那些p值小于0.05的特征，这些就是你要优先匹配的目标。

2. 针对性抽样，让训练子集向测试集对齐

根据分布差异的情况，选合适的抽样方法：

2.1 分层抽样（最常用的基础方法）

如果差异集中在少数关键特征（比如目标变量、核心分类特征），直接按测试集的特征分布比例来抽训练集样本。

比如测试集里类别A占30%、类别B占70%，那你就从训练集里抽同样比例的A和B样本。如果是多特征组合的差异，还可以把多个特征拼接成“组合键”，再按组合键的分布抽样：

# 示例：按多分类特征的组合分布匹配
# 先给训练和测试集生成组合键
train['combined_key'] = train[['cat_feature1', 'cat_feature2']].apply(
    lambda x: '_'.join(x.astype(str)), axis=1
)
test['combined_key'] = test[['cat_feature1', 'cat_feature2']].apply(
    lambda x: '_'.join(x.astype(str)), axis=1
)

# 统计测试集每个组合的占比
test_proportions = test['combined_key'].value_counts(normalize=True)

# 按比例从训练集抽取样本（这里假设总抽样量是训练集的50%，可自行调整）
matched_train = pd.DataFrame()
for key, prop in test_proportions.items():
    sample_size = int(len(train) * prop * 0.5)
    subset = train[train['combined_key'] == key]
    matched_train = pd.concat([matched_train, subset.sample(n=sample_size, random_state=42)])

2.2 倾向性得分匹配（适合复杂多特征差异）

如果很多特征都有分布差异，分层抽样会很麻烦，这时候可以用倾向性得分：把“是否属于测试集”作为目标，训练一个分类模型（比如逻辑回归），预测每个训练样本“看起来像测试集样本”的概率，然后抽取概率高的样本，或者按概率加权抽样。

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 合并训练和测试集，标记样本来源
train['is_test'] = 0
test['is_test'] = 1
combined = pd.concat([train, test], axis=0).reset_index(drop=True)

# 预处理特征：分类特征独热编码，数值特征保留
X = combined.drop('is_test', axis=1)
y = combined['is_test']

encoder = OneHotEncoder(sparse_output=False, drop='first')
cat_encoded = encoder.fit_transform(X[categorical_cols])
cat_encoded_df = pd.DataFrame(cat_encoded, columns=encoder.get_feature_names_out(categorical_cols))

X_processed = pd.concat([X[numerical_cols].reset_index(drop=True), cat_encoded_df], axis=1)

# 训练逻辑回归模型，预测倾向性得分
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_processed, y)

# 给训练集样本计算倾向性得分（即被预测为测试集样本的概率）
train['propensity_score'] = model.predict_proba(X_processed[:len(train)])[:, 1]

# 方法1：抽取得分前50%的样本（最像测试集的部分）
top_threshold = train['propensity_score'].quantile(0.5)
matched_train = train[train['propensity_score'] >= top_threshold].drop(['is_test', 'propensity_score'], axis=1)

# 方法2：按得分加权随机抽样（更灵活，保留更多多样性）
weights = train['propensity_score'] / train['propensity_score'].sum()
matched_train = train.sample(n=int(len(train)*0.5), weights=weights, random_state=42)
matched_train = matched_train.drop(['is_test', 'propensity_score'], axis=1)

2.3 最优传输（进阶方法，适合精准匹配）

如果追求更精准的分布匹配，可以用最优传输（Optimal Transport）算法，计算训练集和测试集样本之间的“距离”，找到最匹配的样本对。不过这个方法计算量较大，适合中等规模的数据集：

import ot
from sklearn.preprocessing import StandardScaler

# 假设已经把特征都转换成了标准化后的数值矩阵
train_matrix = train[numerical_cols].values
test_matrix = test[numerical_cols].values

# 标准化特征
scaler = StandardScaler()
train_matrix = scaler.fit_transform(train_matrix)
test_matrix = scaler.transform(test_matrix)

# 计算成本矩阵（欧氏距离）
cost_matrix = ot.dist(train_matrix, test_matrix, metric='euclidean')

# 计算最优传输计划
transport_plan = ot.emd([], [], cost_matrix)

# 找到每个测试集样本对应的最匹配训练集样本（取权重最大的）
matched_indices = np.argmax(transport_plan, axis=0)
matched_train = train.iloc[matched_indices].drop_duplicates()

3. 验证抽样效果，确保没白忙活

抽完样本后，一定要再用第一步的方法检验一遍，看看新的训练子集和测试集的分布差异是不是变小了。比如再跑一遍KS和卡方检验，或者画对比图。如果大部分特征的p值都大于0.05，说明分布匹配得不错，可以放心用这个子集训练模型了。

小提醒

抽样比例别太极端，至少要保留足够的样本量（比如至少是测试集的2-5倍），不然模型容易欠拟合。
如果是时间序列数据，分布差异可能和时间相关，这时候别乱随机抽样，要按时间窗口选取和测试集时间接近的训练样本。
竞赛里有时候也可以直接用域适应模型（比如Domain-Adversarial Neural Networks）来适配分布，但抽样是更直观、容易解释的办法，新手友好。

内容的提问来源于stack exchange，提问作者Pooja