Scikit-learn混淆矩阵结果异常及添加类标签的技术问询
问题解决:Sklearn分类器结果异常+混淆矩阵带标签实现
作为刚接触机器学习和Python的新手,你遇到的这两个问题其实很常见——咱们一步步来拆解解决:
一、先搞定混淆矩阵的类标签显示
要像Weka那样在混淆矩阵里清晰看到每个类别的真实/预测对应关系,用Pandas把混淆矩阵转成带索引和列名的DataFrame就可以了,直接对应你的类标签:
import pandas as pd from sklearn.metrics import confusion_matrix # 先获取你的数据集里所有唯一的类标签,排序保证顺序一致 class_labels = sorted(df['task'].unique()) # 生成混淆矩阵时指定labels参数,确保顺序和类标签对应 conf_mat = confusion_matrix(class_label, class_label_predicted, labels=class_labels) # 转成带标签的表格 conf_mat_df = pd.DataFrame( conf_mat, index=[f"True: {label}" for label in class_labels], # 行是真实类别 columns=[f"Predicted: {label}" for label in class_labels] # 列是预测类别 ) # 打印或者写入文件都很方便 print(conf_mat_df) file.write(conf_mat_df.to_string() + "\n")
这样输出的混淆矩阵就和Weka格式类似,再也不用猜行列对应的类别了。
二、为什么Sklearn和Weka结果差这么多,且两个分类器表现几乎一致?
你的代码里有几个关键细节和Weka的默认行为不一致,导致了结果异常:
1. 缺少特征预处理
Weka的MLP默认会对特征做标准化/归一化,但你的代码里完全没处理特征尺度!MLP对特征数值范围极其敏感,如果不同特征的数值量级差很大,模型会优先偏向数值大的特征,很容易出现“多数样本归为某一类”的情况。
赶紧加上特征标准化:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # 对特征做标准化处理 attributes_scaled = scaler.fit_transform(attributes)
之后把attributes_scaled代替原来的attributes传入交叉验证即可。
2. 模型参数设置太保守
- 你的MLP用了
hidden_layer_sizes=(5,2),隐藏层太小,拟合能力不足;而且solver='lbfgs'更适合小数据集,换成solver='adam'+增加迭代次数max_iter=500会更靠谱。 - RandomForest的
max_depth=2太浅了,相当于模型根本没学到足够的特征规律,自然和MLP表现类似。可以把max_depth设为None(让树充分生长),同时增加树的数量n_estimators=100。
3. 类别不平衡没处理
从你给出的混淆矩阵看,多数样本被归为第五类,大概率你的数据集存在严重的类别不平衡——第五类样本占比极高。Weka默认会做一些类别平衡处理,而你的Sklearn代码没考虑这点。
解决方法是在分类器里设置class_weight='balanced',让模型自动调整类别权重:
# MLP示例 mlp = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(64,32), random_state=1, max_iter=500, class_weight='balanced') # RandomForest示例 clf = RandomForestClassifier(max_depth=None, random_state=0, n_estimators=100, class_weight='balanced')
4. 交叉验证策略要对齐
Weka默认用的是分层交叉验证(保证每折的类别分布和整体一致),虽然Sklearn的cross_val_predict对分类任务默认用分层KFold,但最好显式指定,避免意外:
from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) class_label_predicted = cross_val_predict(mlp, attributes_scaled, class_label, cv=cv)
三、整合所有改进后的完整代码
把上面的优化点都整合到你的原代码里,最终版本大概是这样:
import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import StratifiedKFold from sklearn.neural_network import MLPClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import cross_val_predict, confusion_matrix, accuracy_score def classify_data(df, feature_cols, file): nbr_folds = 5 attributes = df.loc[:, feature_cols] class_label = df['task'] # 获取并排序类标签,保证混淆矩阵顺序一致 class_labels = sorted(class_label.unique()) # 特征标准化处理 scaler = StandardScaler() attributes_scaled = scaler.fit_transform(attributes) # 显式使用分层KFold交叉验证 cv = StratifiedKFold(n_splits=nbr_folds, shuffle=True, random_state=1) # 写入特征信息 file.write("Features used: " + ", ".join(feature_cols) + "\n\n") print("Features used", feature_cols) # MLP分类器部分 print("=== MLP ===") file.write("=== MLP Classifier Results ===\n") mlp = MLPClassifier( solver='adam', alpha=1e-5, hidden_layer_sizes=(64,32), random_state=1, max_iter=500, class_weight='balanced' ) class_label_predicted = cross_val_predict(mlp, attributes_scaled, class_label, cv=cv) conf_mat = confusion_matrix(class_label, class_label_predicted, labels=class_labels) # 生成带标签的混淆矩阵 conf_mat_df = pd.DataFrame( conf_mat, index=[f"True: {label}" for label in class_labels], columns=[f"Predicted: {label}" for label in class_labels] ) # 打印结果 print(conf_mat_df) accuracy = accuracy_score(class_label, class_label_predicted) print(f"\nRows classified: {len(class_label_predicted)}") print(f"Accuracy: {accuracy * 100:.3f}%\n") # 写入文件 file.write(f"Classifier Settings: {mlp}\n\n") file.write(f"Rows classified: {len(class_label_predicted)}\n") file.write(f"Accuracy: {accuracy * 100:.3f}%\n\n") file.write("Confusion Matrix:\n") file.write(conf_mat_df.to_string() + "\n\n") # RandomForest分类器部分 print("=== RandomForest ===") file.write("=== RandomForest Classifier Results ===\n") clf = RandomForestClassifier( max_depth=None, random_state=0, n_estimators=100, class_weight='balanced' ) class_label_predicted = cross_val_predict(clf, attributes_scaled, class_label, cv=cv) conf_mat = confusion_matrix(class_label, class_label_predicted, labels=class_labels) conf_mat_df = pd.DataFrame( conf_mat, index=[f"True: {label}" for label in class_labels], columns=[f"Predicted: {label}" for label in class_labels] ) print(conf_mat_df) accuracy = accuracy_score(class_label, class_label_predicted) print(f"Rows classified: {len(class_label_predicted)}") print(f"Accuracy: {accuracy * 100:.3f}%\n") file.write(f"Classifier Settings: {clf}\n\n") file.write(f"Rows classified: {len(class_label_predicted)}\n") file.write(f"Accuracy: {accuracy * 100:.3f}%\n\n") file.write("Confusion Matrix:\n") file.write(conf_mat_df.to_string() + "\n\n")
内容的提问来源于stack exchange,提问作者knalle2




