为何Python sklearn特征选择的Chi2（卡方）检验不具备对称性？

阿华AIGC实验室

2026-5-12

Why chi2_1 and chi2_2 Give Different Results?

Great question—this confusion comes from mixing up what scikit-learn's chi2 function is designed to do versus the symmetric chi-squared independence test you're expecting. Let's break this down step by step:

1. Scikit-learn's `chi2` Isn't a Symmetric Independence Test

First off: sklearn's chi2 function is built for feature selection, not for testing the statistical independence of two discrete variables. Its core purpose is to measure how well non-negative features (like word counts, boolean flags, or frequency data) can distinguish between different target classes.

The calculation logic is not symmetric:

It first one-hot encodes the target variable y (for multi-class problems).
It computes the total sum of each feature's values within each class (observed = Y.T @ X).
It calculates expected values assuming the feature's distribution is the same across all classes.
Finally, it computes a goodness-of-fit chi-squared statistic for each feature.

2. Your Input Doesn't Match sklearn's `chi2` Requirements

Your code passes category labels (like 0-4 for x1_q, 0-14 for x2_q) directly as features to chi2, but this function expects non-negative frequency/count data. When you pass category labels, the calculation Y.T @ X ends up summing those label values per class—this is not a meaningful statistical measure, and it's why swapping X and y gives totally different results.

For example:

When you run fs.chi2(y=x1_q, X=x2_m), you're calculating how well the sum of x2's category labels can distinguish between x1's classes.
When you swap to fs.chi2(y=x2_q, X=x1_m), you're calculating how well the sum of x1's category labels can distinguish between x2's classes.
These are two entirely different tests with no reason to produce symmetric results.

3. How to Get the Symmetric Chi-Squared Test You Expect

If you want to test the independence of two discrete variables (and get symmetric results), use scipy.stats.chi2_contingency instead. This function works with contingency tables, which is the standard approach for independence tests:

from scipy.stats import chi2_contingency
import numpy as np
import pandas as pd

n = 100
cov = [[1.0, 0.75], [0.75, 1.0]]
np.random.seed(42)
x = np.random.multivariate_normal([0,0], cov, n)
x1 = x[:,0]
x2 = x[:,1]

x1_q = pd.qcut(x1, q=5, labels=False)
x2_q = pd.qcut(x2, q=15, labels=False)

# Build contingency table for x1 vs x2
contingency_table = pd.crosstab(x1_q, x2_q)
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print("Independence chi2 statistic:", chi2_stat)
print("p-value:", p_val)

# Swap rows and columns—result is identical
contingency_table_reversed = pd.crosstab(x2_q, x1_q)
chi2_stat_rev, p_val_rev, dof_rev, expected_rev = chi2_contingency(contingency_table_reversed)
print("Reversed table chi2 statistic:", chi2_stat_rev)
print("Reversed table p-value:", p_val_rev)

Running this code will show the chi-squared statistic is exactly the same when swapping the variables—this is the symmetric behavior you expected initially.

内容的提问来源于stack exchange，提问作者Peter Tessin