You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

农业产量预测代码优化、缺失值填充及模型参数相关技术咨询

农业产量预测代码优化、缺失值填充及模型参数相关技术咨询

一、你的核心问题梳理

你目前的代码是针对农业产量(Yield)的回归任务,遇到了几个关键问题:

  1. 尝试使用分层K折时触发报错:Stratified splitting requires at least 2 samples per class
  2. 不确定当前的特征工程、模型选择是否最优
  3. 想明确XGBoost不同类型参数(General/Booster/Learning Task)的具体作用

二、代码逐段优化建议

1. 数据读取与特征工程优化

你的特征工程已经覆盖了日期、密度、肥料汇总等维度,但有几个细节可以打磨,避免冗余和潜在问题:

import pandas as pd
import xgboost
import lightgbm
import numpy as np
import catboost
from sklearn.pipeline import *
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import KFold, cross_validate, StratifiedKFold
from datetime import *
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# 读取并预处理基础数据
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
train = train.drop('ID', axis='columns')
test = test.drop('ID', axis='columns')
# 明确提取目标特征(避免返回索引的歧义)
TARGET_FEATURE = train.columns.difference(test.columns).tolist()[0]

def clean(df, training=True, train_template=None):
    # 日期差衍生特征:作物生长周期
    harv = pd.to_datetime(df['Harv_date'], errors='coerce')
    seed = pd.to_datetime(df['SeedingSowingTransplanting'], errors='coerce')
    df['TotalCropDuration'] = (harv - seed).dt.days.fillna(-1).astype(float)
    
    # 密度/效率类衍生特征(移除重复计算的冗余代码)
    df['Irrigation_Density'] = df['TransplantingIrrigationHours'] / (df['Acre'] + 1e-5)
    df['Cultivation_Intensity'] = df['CropCultLand'] / (df['CultLand'] + 1e-5)
    df['Cost_Per_Acre'] = df['TransIrriCost'] / (df['Acre'] + 1e-5)
    
    # 肥料汇总类特征
    df['Total_Urea'] = df[['BasalUrea', '1tdUrea', '2tdUrea']].sum(axis=1)
    df['Total_Basal'] = df[['BasalDAP', 'BasalUrea']].sum(axis=1)
    df['Total_Fertilizer'] = df['Total_Urea'] + df['BasalDAP'] + df['Ganaura'] + df['CropOrgFYM']
    df['Nutrient_Density'] = df['Total_Fertilizer'] / (df['Acre'] + 1e-5)
    
    # 类别特征统一处理(关键:训练集与测试集类别对齐)
    for col in df.select_dtypes('object').columns:
        if training:
            # 训练集填充缺失值并转为类别型
            df[col] = df[col].fillna('Missing').astype('category')
        else:
            # 测试集严格沿用训练集的类别定义,避免出现未知类别
            df[col] = pd.Categorical(
                df[col].fillna('Missing'),
                categories=train_template[col].cat.categories
            )
    return df

# 注意:测试集清洗时传入训练集作为类别模板
train = clean(train, training=True)
test = clean(test, training=False, train_template=train)

X_train = train.drop(TARGET_FEATURE, axis='columns')
y_train = train[TARGET_FEATURE]
X_test = test

# 自动识别类别特征
cat_features = [col for col in X_train.columns if X_train[col].dtype.name == 'category']

优化点说明

  • 移除了重复计算Irrigation_Density的冗余代码
  • 新增训练集-测试集类别对齐逻辑:树模型(LightGBM/CatBoost)对类别特征的一致性要求极高,测试集不能出现训练集未见过的类别
  • 明确TARGET_FEATURE的取值方式,避免返回索引的歧义

2. 模型与交叉验证优化

分层K折报错的解决

你遇到的报错是因为StratifiedKFold是为分类任务设计的,它需要按离散类别分层,但你的任务是回归(预测连续值Yield),所以不能直接用。如果想做类似分层的回归交叉验证,可以把目标变量分箱为“伪类别”:

# 回归任务的分层交叉验证实现(伪类别分箱)
# 用分位数把连续目标变量分成10个区间,作为伪类别
y_binned = pd.qcut(y_train, q=10, labels=False, duplicates='drop')
# 过滤掉样本数不足2的伪类别(满足分层要求)
valid_bins = y_binned.value_counts()[y_binned.value_counts()>=2].index
train_filtered = train[y_binned.isin(valid_bins)]
X_train_filtered = train_filtered.drop(TARGET_FEATURE, axis=1)
y_train_filtered = train_filtered[TARGET_FEATURE]
y_binned_filtered = y_binned[y_binned.isin(valid_bins)]

# 用过滤后的数据集做分层交叉验证
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=88301)
# 以LightGBM为例,传入类别特征参数
model = lightgbm.LGBMRegressor(categorical_feature=cat_features, verbose=-1)
results = cross_validate(
    model, X_train_filtered, y_train_filtered,
    scoring='neg_root_mean_squared_error', cv=kf
)
print(f"CV平均RMSE: {-results['test_score'].mean():.4f}")

模型选择建议

你尝试的XGBoost/LightGBM/CatBoost都是结构化数据的首选模型,针对农业数据的多类别特征,CatBoost和LightGBM的原生类别支持更友好,这里给出优化后的参数示例:

# CatBoost 优化参数(适合类别特征多的场景)
model = catboost.CatBoostRegressor(
    loss_function='RMSE',
    cat_features=cat_features,
    iterations=3000,
    verbose=100,
    l2_leaf_reg=24,
    learning_rate=0.05,  # 减小学习率,配合多迭代次数提升稳定性
    subsample=0.8,  # 样本采样,增加随机性减少过拟合
    colsample_bylevel=0.8  # 特征采样,降低特征冗余
)

# LightGBM 优化参数(训练速度更快)
model = lightgbm.LGBMRegressor(
    objective='regression',
    metric='rmse',
    num_leaves=31,  # 控制树的复杂度,避免过拟合
    learning_rate=0.05,
    n_estimators=3000,
    subsample=0.8,
    colsample_bytree=0.8,
    categorical_feature=cat_features,
    verbose=-1
)

3. 自动编码器特征融合优化

你尝试用自动编码器提取潜在特征,但原代码缺少验证集定义和特征融合逻辑,这里补充完整:

# 自动编码器仅处理数值特征,先标准化
from sklearn.preprocessing import StandardScaler
num_features = [col for col in X_train.columns if X_train[col].dtype != 'category']
scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train[num_features])
X_test_num_scaled = scaler.transform(X_test[num_features])

# 构建自动编码器
input_dim = len(num_features)
encoding_dim = 8 

input_layer = Input(shape=(input_dim,))
encoded = Dense(32, activation='relu')(input_layer)
encoded = Dense(16, activation='relu')(encoded)
bottleneck = Dense(encoding_dim, activation='relu', name='bottleneck')(encoded)
decoded = Dense(16, activation='relu')(bottleneck)
decoded = Dense(32, activation='relu')(decoded)
output_layer = Dense(input_dim, activation='linear')(decoded)

autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='mse')

# 拆分训练集做验证
from sklearn.model_selection import train_test_split
X_num_train, X_num_val = train_test_split(
    X_train_num_scaled, test_size=0.2, random_state=88301
)
autoencoder.fit(
    X_num_train, X_num_train,
    epochs=50, batch_size=256,
    validation_data=(X_num_val, X_num_val)
)

# 提取潜在特征并与原特征拼接
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('bottleneck').output)
X_train_latent = encoder.predict(X_train_num_scaled)
X_test_latent = encoder.predict(X_test_num_scaled)

# 融合特征
X_train_enhanced = pd.concat([
    X_train.reset_index(drop=True),
    pd.DataFrame(X_train_latent, columns=[f'latent_{i}' for i in range(encoding_dim)])
], axis=1)
X_test_enhanced = pd.concat([
    X_test.reset_index(drop=True),
    pd.DataFrame(X_test_latent, columns=[f'latent_{i}' for i in range(encoding_dim)])
], axis=1)

# 用增强特征训练模型
model.fit(X_train_enhanced, y_train)
predictions = model.predict(X_test_enhanced)

4. 最终提交代码

final_sub = pd.read_csv("SampleSubmission.csv")
final_sub['Yield_RMSE'] = predictions 
final_sub['Yield_MAE'] = predictions  # 若需区分指标,建议分别训练对应模型
final_sub.to_csv('Optimized_Submission.csv', index=False)  # 避免生成多余索引列

三、XGBoost三类参数详解

你提到的三类参数,作用明确区分:

1. General Parameters(通用参数)

控制模型的整体框架和运行方式:

  • booster: 选择基学习器,gbtree(树模型,默认,适合非线性数据)/gblinear(线性模型,适合线性关系强的数据)/dart(带 dropout 的树模型,进一步减少过拟合)
  • verbosity: 日志输出级别,0(无输出)/1(仅警告)/2(详细训练日志)
  • gpu_id: 指定GPU设备ID,启用GPU加速训练(需硬件支持)

2. Booster Parameters(Booster参数)

控制树模型的结构和正则化,核心是平衡复杂度与过拟合:

  • max_depth: 树的最大深度,默认6,值越大模型越复杂,容易过拟合
  • min_child_weight: 叶子节点所需的最小样本权重和,默认1,值越大模型越保守
  • subsample: 训练每棵树时的样本采样比例,默认1,设置0.8-0.9可增加随机性
  • colsample_bytree: 训练每棵树时的特征采样比例,默认1,减少特征冗余
  • gamma: 树分裂所需的最小损失减少值,默认0,值越大模型越保守

3. Learning Task Parameters(学习任务参数)

定义任务类型和损失函数:

  • objective: 目标函数,回归任务用reg:squarederror(RMSE)/reg:logistic(概率回归);分类任务用binary:logistic(二分类)/multi:softmax(多分类)
  • eval_metric: 评估指标,回归用rmse/mae;分类用auc/logloss
  • learning_rate: 步长收缩系数,默认0.3,值越小需配合更多n_estimators,但模型更稳定

火山引擎 最新活动