You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Scikit-learn的词袋分类任务中添加新文本特征?

How to Add Another Text Feature to Your Bag-of-Words Classification Task in Scikit-learn

Adding an extra text feature is straightforward once you know how to combine the processed features from each text column. Here's a step-by-step guide with code examples that fit right into your existing workflow:

Key Approach

Each text feature needs its own vectorization (whether using CountVectorizer alone or paired with TfidfTransformer for TF-IDF), then we combine these feature matrices into a single input for your classifier. We’ll cover two reliable methods: using ColumnTransformer (scikit-learn’s recommended, clean approach) and a manual step-by-step method with hstack.


Method 1: Using ColumnTransformer (Idiomatic Scikit-learn)

This method wraps all preprocessing into a pipeline, which prevents data leakage and keeps your code organized.

Step 1: Add Required Imports

Include these alongside your existing imports:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Step 2: Define Preprocessing for Each Text Feature

Assume your data has two text columns: Extract (your original feature) and Additional_Text (the new feature you want to add). We’ll create separate preprocessing pipelines for each—you can customize parameters like stop words or n-grams per feature:

# Preprocessing for original text feature
text_preprocessor_1 = Pipeline(steps=[
    ('count', CountVectorizer(stop_words='english')),  # Adjust params as needed
    ('tfidf', TfidfTransformer())
])

# Preprocessing for new text feature
text_preprocessor_2 = Pipeline(steps=[
    ('count', CountVectorizer(ngram_range=(1,2))),  # Use different params here if useful
    ('tfidf', TfidfTransformer())
])

Step 3: Combine Preprocessors with ColumnTransformer

This applies each preprocessor to its target column and concatenates the results into one feature matrix:

preprocessor = ColumnTransformer(
    transformers=[
        ('text1', text_preprocessor_1, 'Extract'),
        ('text2', text_preprocessor_2, 'Additional_Text')  # Replace with your actual new column name
    ])

Step 4: Build Full Pipeline with Classifier

Combine the preprocessor with your chosen classifier (e.g., MultinomialNB or RandomForest):

# Example with MultinomialNB
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())])

# Split your data (note: we pass the dataframe subset with both text columns now)
X_train, X_test, y_train, y_test = train_test_split(
    data[['Extract', 'Additional_Text']], 
    data['Expense Account code Description'], 
    random_state=42
)

# Train and evaluate
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Method 2: Manual Feature Combination (Using hstack)

If you prefer a more explicit, step-by-step approach:

Step 1: Process Each Text Feature Separately

Critical: Fit vectorizers and transformers only on the training data to avoid data leakage:

# Original feature processing
count_vec1 = CountVectorizer(stop_words='english')
X_train_count1 = count_vec1.fit_transform(X_train['Extract'])
X_test_count1 = count_vec1.transform(X_test['Extract'])

tfidf_transformer1 = TfidfTransformer()
X_train_tfidf1 = tfidf_transformer1.fit_transform(X_train_count1)
X_test_tfidf1 = tfidf_transformer1.transform(X_test_count1)

# New feature processing
count_vec2 = CountVectorizer(ngram_range=(1,2))
X_train_count2 = count_vec2.fit_transform(X_train['Additional_Text'])
X_test_count2 = count_vec2.transform(X_test['Additional_Text'])

tfidf_transformer2 = TfidfTransformer()
X_train_tfidf2 = tfidf_transformer2.fit_transform(X_train_count2)
X_test_tfidf2 = tfidf_transformer2.transform(X_test_count2)

Step 2: Combine Feature Matrices

Use scipy.sparse.hstack to efficiently concatenate the sparse text feature matrices:

from scipy.sparse import hstack

X_train_combined = hstack([X_train_tfidf1, X_train_tfidf2])
X_test_combined = hstack([X_test_tfidf1, X_test_tfidf2])

Step 3: Train and Evaluate

Now use the combined matrices with your classifier:

clf = MultinomialNB()
clf.fit(X_train_combined, y_train)
y_pred = clf.predict(X_test_combined)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Important Tips

  • Avoid Data Leakage: Never fit preprocessing tools on your full dataset—always use the training set only for fitting, then transform both train and test sets. The ColumnTransformer method handles this automatically.
  • Customize Vectorizer Params: Tweak parameters like stop_words, ngram_range, or max_features for each text feature independently to optimize model performance.
  • Sparse Matrices: Text features produce sparse matrices, so use scipy.sparse.hstack instead of numpy’s regular hstack to keep operations efficient.

内容的提问来源于stack exchange,提问作者Ayush Agrawal

火山引擎 最新活动