特征哈希（Feature Hashing）的具体工作原理是什么？

阿华AIGC实验室

2026-5-25

Hey there, let's break down feature hashing (also called the hashing trick) using your exact example—this should make the concept click. I get why it's confusing at first, but seeing it applied to your data will make it all make sense.

First, let's restate your data and partial function clearly:

Your Sample Data

>>> data
   pop      state  year
0  1.5      Ohio  2000
1  1.7      Ohio  2001
2  3.6  New York  2002
3  2.4    Nevada  2001
4  2.9    Nevada  2002
5  1.8    Oregon  2003

Your Partial Hash Function

def hash_col(df, col, N):
    cols = [col + "_" + str(i) for i in range(N)]
    # The rest of the logic handles hashing values and mapping to these columns

What's Feature Hashing, Anyway?

Let's start with the problem it solves. If you used one-hot encoding on the state column, you'd get 4 new columns (one for each unique state: Ohio, New York, Nevada, Oregon). But if you had 10,000 unique states later, you'd end up with 10,000 columns—this is the "curse of dimensionality," and it kills model memory and performance.

Feature hashing fixes this by mapping high-dimensional categorical features to a fixed, low-dimensional space using a hash function. Instead of 4 columns, you could map to 2 or 3 columns—no matter how many unique states you add later.

How It Works With Your Data

Let's finish that hash_col function to make it concrete. Here's the full logic it's likely implementing:

For each value in the target column (like "Ohio" in state), run it through a hash function (Python's built-in hash() or a cryptographic hash) to get a large integer.
Take that integer modulo N (the number of low-dimensional columns you want) to get an index between 0 and N-1.
Create N new columns, and for each row, set the column at the hashed index to 1 (or a weight if you're using weighted features).

Let's pick N=2 for your data and walk through the steps:

For "Ohio": Suppose hash("Ohio") returns 12345. 12345 % 2 = 1 → set state_1 to 1, state_0 to 0.
For "New York": hash("New York") returns 67890. 67890 % 2 = 0 → set state_0 to 1, state_1 to 0.
For "Nevada": hash("Nevada") returns 11223. 11223 % 2 = 1 → set state_1 to 1.
For "Oregon": hash("Oregon") returns 44556. 44556 % 2 = 0 → set state_0 to 1.

Your transformed data would look like this:

pop  year  state_0  state_1
0  1.5  2000        0        1
1  1.7  2001        0        1
2  3.6  2002        1        0
3  2.4  2001        0        1
4  2.9  2002        0        1
5  1.8  2003        1        0

The Big Trade-off: Hash Collisions

You might notice "Ohio" and "Nevada" both mapped to state_1, and "New York" and "Oregon" mapped to state_0—this is a hash collision, where two different values land in the same low-dimensional column.

The good news? In most cases, models can still learn meaningful patterns even with some collisions, especially if you pick a reasonable N (usually 1/2 to 1/3 the number of unique original categories, or a power of two like 64/128 for larger datasets). The benefit of smaller, fixed dimensions far outweighs the minor performance hit from collisions.

Why Your Partial Function Makes Sense

Your line cols = [col + "_" + str(i)... is just creating names for the new low-dimensional columns (like state_0, state_1). The rest of the function would handle hashing each value and populating those columns with 1s (or weights) based on the modulo result.

Hope that clears up the confusion—feel free to ask if you want to dive deeper into any part!

内容的提问来源于stack exchange，提问作者Stanleyrr