为何Python sklearn特征选择的Chi2(卡方)检验不具备对称性?
chi2_1 and chi2_2 Give Different Results? Great question—this confusion comes from mixing up what scikit-learn's chi2 function is designed to do versus the symmetric chi-squared independence test you're expecting. Let's break this down step by step:
1. Scikit-learn's chi2 Isn't a Symmetric Independence Test
First off: sklearn's chi2 function is built for feature selection, not for testing the statistical independence of two discrete variables. Its core purpose is to measure how well non-negative features (like word counts, boolean flags, or frequency data) can distinguish between different target classes.
The calculation logic is not symmetric:
- It first one-hot encodes the target variable
y(for multi-class problems). - It computes the total sum of each feature's values within each class (
observed = Y.T @ X). - It calculates expected values assuming the feature's distribution is the same across all classes.
- Finally, it computes a goodness-of-fit chi-squared statistic for each feature.
2. Your Input Doesn't Match sklearn's chi2 Requirements
Your code passes category labels (like 0-4 for x1_q, 0-14 for x2_q) directly as features to chi2, but this function expects non-negative frequency/count data. When you pass category labels, the calculation Y.T @ X ends up summing those label values per class—this is not a meaningful statistical measure, and it's why swapping X and y gives totally different results.
For example:
- When you run
fs.chi2(y=x1_q, X=x2_m), you're calculating how well the sum ofx2's category labels can distinguish betweenx1's classes. - When you swap to
fs.chi2(y=x2_q, X=x1_m), you're calculating how well the sum ofx1's category labels can distinguish betweenx2's classes.
These are two entirely different tests with no reason to produce symmetric results.
3. How to Get the Symmetric Chi-Squared Test You Expect
If you want to test the independence of two discrete variables (and get symmetric results), use scipy.stats.chi2_contingency instead. This function works with contingency tables, which is the standard approach for independence tests:
from scipy.stats import chi2_contingency import numpy as np import pandas as pd n = 100 cov = [[1.0, 0.75], [0.75, 1.0]] np.random.seed(42) x = np.random.multivariate_normal([0,0], cov, n) x1 = x[:,0] x2 = x[:,1] x1_q = pd.qcut(x1, q=5, labels=False) x2_q = pd.qcut(x2, q=15, labels=False) # Build contingency table for x1 vs x2 contingency_table = pd.crosstab(x1_q, x2_q) chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table) print("Independence chi2 statistic:", chi2_stat) print("p-value:", p_val) # Swap rows and columns—result is identical contingency_table_reversed = pd.crosstab(x2_q, x1_q) chi2_stat_rev, p_val_rev, dof_rev, expected_rev = chi2_contingency(contingency_table_reversed) print("Reversed table chi2 statistic:", chi2_stat_rev) print("Reversed table p-value:", p_val_rev)
Running this code will show the chi-squared statistic is exactly the same when swapping the variables—this is the symmetric behavior you expected initially.
内容的提问来源于stack exchange,提问作者Peter Tessin




