农业产量预测代码优化、缺失值填充及模型参数相关技术咨询
农业产量预测代码优化、缺失值填充及模型参数相关技术咨询
一、你的核心问题梳理
你目前的代码是针对农业产量(Yield)的回归任务,遇到了几个关键问题:
- 尝试使用分层K折时触发报错:
Stratified splitting requires at least 2 samples per class - 不确定当前的特征工程、模型选择是否最优
- 想明确XGBoost不同类型参数(General/Booster/Learning Task)的具体作用
二、代码逐段优化建议
1. 数据读取与特征工程优化
你的特征工程已经覆盖了日期、密度、肥料汇总等维度,但有几个细节可以打磨,避免冗余和潜在问题:
import pandas as pd import xgboost import lightgbm import numpy as np import catboost from sklearn.pipeline import * from sklearn.compose import TransformedTargetRegressor from sklearn.model_selection import KFold, cross_validate, StratifiedKFold from datetime import * from tensorflow.keras.layers import Input, Dense from tensorflow.keras.models import Model # 读取并预处理基础数据 train = pd.read_csv('Train.csv') test = pd.read_csv('Test.csv') train = train.drop('ID', axis='columns') test = test.drop('ID', axis='columns') # 明确提取目标特征(避免返回索引的歧义) TARGET_FEATURE = train.columns.difference(test.columns).tolist()[0] def clean(df, training=True, train_template=None): # 日期差衍生特征:作物生长周期 harv = pd.to_datetime(df['Harv_date'], errors='coerce') seed = pd.to_datetime(df['SeedingSowingTransplanting'], errors='coerce') df['TotalCropDuration'] = (harv - seed).dt.days.fillna(-1).astype(float) # 密度/效率类衍生特征(移除重复计算的冗余代码) df['Irrigation_Density'] = df['TransplantingIrrigationHours'] / (df['Acre'] + 1e-5) df['Cultivation_Intensity'] = df['CropCultLand'] / (df['CultLand'] + 1e-5) df['Cost_Per_Acre'] = df['TransIrriCost'] / (df['Acre'] + 1e-5) # 肥料汇总类特征 df['Total_Urea'] = df[['BasalUrea', '1tdUrea', '2tdUrea']].sum(axis=1) df['Total_Basal'] = df[['BasalDAP', 'BasalUrea']].sum(axis=1) df['Total_Fertilizer'] = df['Total_Urea'] + df['BasalDAP'] + df['Ganaura'] + df['CropOrgFYM'] df['Nutrient_Density'] = df['Total_Fertilizer'] / (df['Acre'] + 1e-5) # 类别特征统一处理(关键:训练集与测试集类别对齐) for col in df.select_dtypes('object').columns: if training: # 训练集填充缺失值并转为类别型 df[col] = df[col].fillna('Missing').astype('category') else: # 测试集严格沿用训练集的类别定义,避免出现未知类别 df[col] = pd.Categorical( df[col].fillna('Missing'), categories=train_template[col].cat.categories ) return df # 注意:测试集清洗时传入训练集作为类别模板 train = clean(train, training=True) test = clean(test, training=False, train_template=train) X_train = train.drop(TARGET_FEATURE, axis='columns') y_train = train[TARGET_FEATURE] X_test = test # 自动识别类别特征 cat_features = [col for col in X_train.columns if X_train[col].dtype.name == 'category']
优化点说明:
- 移除了重复计算
Irrigation_Density的冗余代码 - 新增训练集-测试集类别对齐逻辑:树模型(LightGBM/CatBoost)对类别特征的一致性要求极高,测试集不能出现训练集未见过的类别
- 明确
TARGET_FEATURE的取值方式,避免返回索引的歧义
2. 模型与交叉验证优化
分层K折报错的解决
你遇到的报错是因为StratifiedKFold是为分类任务设计的,它需要按离散类别分层,但你的任务是回归(预测连续值Yield),所以不能直接用。如果想做类似分层的回归交叉验证,可以把目标变量分箱为“伪类别”:
# 回归任务的分层交叉验证实现(伪类别分箱) # 用分位数把连续目标变量分成10个区间,作为伪类别 y_binned = pd.qcut(y_train, q=10, labels=False, duplicates='drop') # 过滤掉样本数不足2的伪类别(满足分层要求) valid_bins = y_binned.value_counts()[y_binned.value_counts()>=2].index train_filtered = train[y_binned.isin(valid_bins)] X_train_filtered = train_filtered.drop(TARGET_FEATURE, axis=1) y_train_filtered = train_filtered[TARGET_FEATURE] y_binned_filtered = y_binned[y_binned.isin(valid_bins)] # 用过滤后的数据集做分层交叉验证 kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=88301) # 以LightGBM为例,传入类别特征参数 model = lightgbm.LGBMRegressor(categorical_feature=cat_features, verbose=-1) results = cross_validate( model, X_train_filtered, y_train_filtered, scoring='neg_root_mean_squared_error', cv=kf ) print(f"CV平均RMSE: {-results['test_score'].mean():.4f}")
模型选择建议
你尝试的XGBoost/LightGBM/CatBoost都是结构化数据的首选模型,针对农业数据的多类别特征,CatBoost和LightGBM的原生类别支持更友好,这里给出优化后的参数示例:
# CatBoost 优化参数(适合类别特征多的场景) model = catboost.CatBoostRegressor( loss_function='RMSE', cat_features=cat_features, iterations=3000, verbose=100, l2_leaf_reg=24, learning_rate=0.05, # 减小学习率,配合多迭代次数提升稳定性 subsample=0.8, # 样本采样,增加随机性减少过拟合 colsample_bylevel=0.8 # 特征采样,降低特征冗余 ) # LightGBM 优化参数(训练速度更快) model = lightgbm.LGBMRegressor( objective='regression', metric='rmse', num_leaves=31, # 控制树的复杂度,避免过拟合 learning_rate=0.05, n_estimators=3000, subsample=0.8, colsample_bytree=0.8, categorical_feature=cat_features, verbose=-1 )
3. 自动编码器特征融合优化
你尝试用自动编码器提取潜在特征,但原代码缺少验证集定义和特征融合逻辑,这里补充完整:
# 自动编码器仅处理数值特征,先标准化 from sklearn.preprocessing import StandardScaler num_features = [col for col in X_train.columns if X_train[col].dtype != 'category'] scaler = StandardScaler() X_train_num_scaled = scaler.fit_transform(X_train[num_features]) X_test_num_scaled = scaler.transform(X_test[num_features]) # 构建自动编码器 input_dim = len(num_features) encoding_dim = 8 input_layer = Input(shape=(input_dim,)) encoded = Dense(32, activation='relu')(input_layer) encoded = Dense(16, activation='relu')(encoded) bottleneck = Dense(encoding_dim, activation='relu', name='bottleneck')(encoded) decoded = Dense(16, activation='relu')(bottleneck) decoded = Dense(32, activation='relu')(decoded) output_layer = Dense(input_dim, activation='linear')(decoded) autoencoder = Model(inputs=input_layer, outputs=output_layer) autoencoder.compile(optimizer='adam', loss='mse') # 拆分训练集做验证 from sklearn.model_selection import train_test_split X_num_train, X_num_val = train_test_split( X_train_num_scaled, test_size=0.2, random_state=88301 ) autoencoder.fit( X_num_train, X_num_train, epochs=50, batch_size=256, validation_data=(X_num_val, X_num_val) ) # 提取潜在特征并与原特征拼接 encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('bottleneck').output) X_train_latent = encoder.predict(X_train_num_scaled) X_test_latent = encoder.predict(X_test_num_scaled) # 融合特征 X_train_enhanced = pd.concat([ X_train.reset_index(drop=True), pd.DataFrame(X_train_latent, columns=[f'latent_{i}' for i in range(encoding_dim)]) ], axis=1) X_test_enhanced = pd.concat([ X_test.reset_index(drop=True), pd.DataFrame(X_test_latent, columns=[f'latent_{i}' for i in range(encoding_dim)]) ], axis=1) # 用增强特征训练模型 model.fit(X_train_enhanced, y_train) predictions = model.predict(X_test_enhanced)
4. 最终提交代码
final_sub = pd.read_csv("SampleSubmission.csv") final_sub['Yield_RMSE'] = predictions final_sub['Yield_MAE'] = predictions # 若需区分指标,建议分别训练对应模型 final_sub.to_csv('Optimized_Submission.csv', index=False) # 避免生成多余索引列
三、XGBoost三类参数详解
你提到的三类参数,作用明确区分:
1. General Parameters(通用参数)
控制模型的整体框架和运行方式:
booster: 选择基学习器,gbtree(树模型,默认,适合非线性数据)/gblinear(线性模型,适合线性关系强的数据)/dart(带 dropout 的树模型,进一步减少过拟合)verbosity: 日志输出级别,0(无输出)/1(仅警告)/2(详细训练日志)gpu_id: 指定GPU设备ID,启用GPU加速训练(需硬件支持)
2. Booster Parameters(Booster参数)
控制树模型的结构和正则化,核心是平衡复杂度与过拟合:
max_depth: 树的最大深度,默认6,值越大模型越复杂,容易过拟合min_child_weight: 叶子节点所需的最小样本权重和,默认1,值越大模型越保守subsample: 训练每棵树时的样本采样比例,默认1,设置0.8-0.9可增加随机性colsample_bytree: 训练每棵树时的特征采样比例,默认1,减少特征冗余gamma: 树分裂所需的最小损失减少值,默认0,值越大模型越保守
3. Learning Task Parameters(学习任务参数)
定义任务类型和损失函数:
objective: 目标函数,回归任务用reg:squarederror(RMSE)/reg:logistic(概率回归);分类任务用binary:logistic(二分类)/multi:softmax(多分类)eval_metric: 评估指标,回归用rmse/mae;分类用auc/loglosslearning_rate: 步长收缩系数,默认0.3,值越小需配合更多n_estimators,但模型更稳定




