如何在Scikit-learn的词袋分类任务中添加新文本特征？

阿华AIGC实验室

2026-5-26

How to Add Another Text Feature to Your Bag-of-Words Classification Task in Scikit-learn

Adding an extra text feature is straightforward once you know how to combine the processed features from each text column. Here's a step-by-step guide with code examples that fit right into your existing workflow:

Key Approach

Each text feature needs its own vectorization (whether using CountVectorizer alone or paired with TfidfTransformer for TF-IDF), then we combine these feature matrices into a single input for your classifier. We’ll cover two reliable methods: using ColumnTransformer (scikit-learn’s recommended, clean approach) and a manual step-by-step method with hstack.

Method 1: Using ColumnTransformer (Idiomatic Scikit-learn)

This method wraps all preprocessing into a pipeline, which prevents data leakage and keeps your code organized.

Step 1: Add Required Imports

Include these alongside your existing imports:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Step 2: Define Preprocessing for Each Text Feature

Assume your data has two text columns: Extract (your original feature) and Additional_Text (the new feature you want to add). We’ll create separate preprocessing pipelines for each—you can customize parameters like stop words or n-grams per feature:

# Preprocessing for original text feature
text_preprocessor_1 = Pipeline(steps=[
    ('count', CountVectorizer(stop_words='english')),  # Adjust params as needed
    ('tfidf', TfidfTransformer())
])

# Preprocessing for new text feature
text_preprocessor_2 = Pipeline(steps=[
    ('count', CountVectorizer(ngram_range=(1,2))),  # Use different params here if useful
    ('tfidf', TfidfTransformer())
])

Step 3: Combine Preprocessors with ColumnTransformer

This applies each preprocessor to its target column and concatenates the results into one feature matrix:

preprocessor = ColumnTransformer(
    transformers=[
        ('text1', text_preprocessor_1, 'Extract'),
        ('text2', text_preprocessor_2, 'Additional_Text')  # Replace with your actual new column name
    ])

Step 4: Build Full Pipeline with Classifier

Combine the preprocessor with your chosen classifier (e.g., MultinomialNB or RandomForest):

# Example with MultinomialNB
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())])

# Split your data (note: we pass the dataframe subset with both text columns now)
X_train, X_test, y_train, y_test = train_test_split(
    data[['Extract', 'Additional_Text']], 
    data['Expense Account code Description'], 
    random_state=42
)

# Train and evaluate
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Method 2: Manual Feature Combination (Using hstack)

If you prefer a more explicit, step-by-step approach:

Step 1: Process Each Text Feature Separately

Critical: Fit vectorizers and transformers only on the training data to avoid data leakage:

# Original feature processing
count_vec1 = CountVectorizer(stop_words='english')
X_train_count1 = count_vec1.fit_transform(X_train['Extract'])
X_test_count1 = count_vec1.transform(X_test['Extract'])

tfidf_transformer1 = TfidfTransformer()
X_train_tfidf1 = tfidf_transformer1.fit_transform(X_train_count1)
X_test_tfidf1 = tfidf_transformer1.transform(X_test_count1)

# New feature processing
count_vec2 = CountVectorizer(ngram_range=(1,2))
X_train_count2 = count_vec2.fit_transform(X_train['Additional_Text'])
X_test_count2 = count_vec2.transform(X_test['Additional_Text'])

tfidf_transformer2 = TfidfTransformer()
X_train_tfidf2 = tfidf_transformer2.fit_transform(X_train_count2)
X_test_tfidf2 = tfidf_transformer2.transform(X_test_count2)

Step 2: Combine Feature Matrices

Use scipy.sparse.hstack to efficiently concatenate the sparse text feature matrices:

from scipy.sparse import hstack

X_train_combined = hstack([X_train_tfidf1, X_train_tfidf2])
X_test_combined = hstack([X_test_tfidf1, X_test_tfidf2])

Step 3: Train and Evaluate

Now use the combined matrices with your classifier:

clf = MultinomialNB()
clf.fit(X_train_combined, y_train)
y_pred = clf.predict(X_test_combined)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Important Tips

Avoid Data Leakage: Never fit preprocessing tools on your full dataset—always use the training set only for fitting, then transform both train and test sets. The ColumnTransformer method handles this automatically.
Customize Vectorizer Params: Tweak parameters like stop_words, ngram_range, or max_features for each text feature independently to optimize model performance.
Sparse Matrices: Text features produce sparse matrices, so use scipy.sparse.hstack instead of numpy’s regular hstack to keep operations efficient.

内容的提问来源于stack exchange，提问作者Ayush Agrawal