如何对含类别置信度的列表型列进行独热编码并将置信度信息融入决策树分类器

阿华AIGC实验室

2026-4-30

Efficiently Encode Categories with Confidence Scores for Decision Trees

Perfect question—you don't need to waste time generating a standard one-hot encoding and then manually updating values. We can directly transform your category-with-confidence data into the exact format you need using pandas, with minimal code. Here's a clean, efficient approach:

Step 1: Prepare Your Data

First, let's load your sample data into a pandas DataFrame (adjust this to match your actual data source):

import pandas as pd

# Sample input data
data = {
    '行号': [0, 1, 2],
    'categories': [
        [{'a': 0.3}, {'b': 0.4}, {'c': 0.5}],
        [{'c': 0.8}],
        [{'b': 1.0}, {'c': 1.0}, {'e': 0.1}]
    ]
}
df = pd.DataFrame(data)

Step 2: Create a Helper Function to Flatten Category Confidences

We'll write a simple function to merge the list of small dictionaries in each row into a single dictionary mapping categories to their confidence scores:

def flatten_confidence_categories(cat_list):
    # Merge all dictionaries in the list into one key-value map
    return {key: value for item in cat_list for key, value in item.items()}

This is a concise way to unpack each small dictionary and combine them into one—no messy loops needed.

Step 3: Transform and Encode the Data

Now we'll apply this function to the categories column, expand the resulting dictionaries into separate columns, and fill missing values (categories not present in a row) with 0:

# Apply the helper function and expand to columns
encoded_features = df['categories'].apply(lambda x: pd.Series(flatten_confidence_categories(x))).fillna(0)

# Combine with the original "行号" column to get your final output
final_df = pd.concat([df['行号'], encoded_features], axis=1)

# Optional: Reorder columns to match your desired output structure
final_df = final_df[['行号', 'a', 'b', 'c', 'e']]

Step 4: Check the Result

If you print final_df, you'll get exactly the format you wanted:

行号    a    b    c    e
0    0  0.3  0.4  0.5  0.0
1    1  0.0  0.0  0.8  0.0
2    2  0.0  1.0  1.0  0.1

Why This Works Better

Efficiency: This processes each row in one pass, avoiding the extra step of generating a standard one-hot encoding first. It's much faster for large datasets.
Auto-detection: It automatically identifies all unique categories across your dataset—no need to manually specify which columns to create.
Clean code: The helper function is reusable, and the overall logic is easy to read and maintain.

This approach is perfectly suited for feeding into a decision tree classifier, as it preserves the confidence values as continuous features instead of binary indicators.

内容的提问来源于stack exchange，提问作者john