Scikit-learn官方文档标注支持原生NaN值的RandomForestClassifier实际运行报错的问题咨询
我手头有个包含大量缺失值的数据集,之前在Scikit-learn的文档里看到有一批原生支持NaN值的算法列表,里面就包括RandomForestClassifier。
但当我用这个数据集运行随机森林模型时,却遇到了如下错误:
ValueError: Input X contains NaN.
RandomForestClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
我有点疑惑,会不会是Scikit-learn的文档没有及时更新,错误地把RandomForestClassifier列入了支持原生NaN的列表里?
目前我打算先尝试列表里的其他算法,从HistGradientBoostingClassifier开始。
谢谢大家!
备注:内容来源于stack exchange,提问作者Aaron Weidman




