如何在Scikit-learn的词袋分类任务中添加新文本特征?
Adding an extra text feature is straightforward once you know how to combine the processed features from each text column. Here's a step-by-step guide with code examples that fit right into your existing workflow:
Key Approach
Each text feature needs its own vectorization (whether using CountVectorizer alone or paired with TfidfTransformer for TF-IDF), then we combine these feature matrices into a single input for your classifier. We’ll cover two reliable methods: using ColumnTransformer (scikit-learn’s recommended, clean approach) and a manual step-by-step method with hstack.
Method 1: Using ColumnTransformer (Idiomatic Scikit-learn)
This method wraps all preprocessing into a pipeline, which prevents data leakage and keeps your code organized.
Step 1: Add Required Imports
Include these alongside your existing imports:
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline
Step 2: Define Preprocessing for Each Text Feature
Assume your data has two text columns: Extract (your original feature) and Additional_Text (the new feature you want to add). We’ll create separate preprocessing pipelines for each—you can customize parameters like stop words or n-grams per feature:
# Preprocessing for original text feature text_preprocessor_1 = Pipeline(steps=[ ('count', CountVectorizer(stop_words='english')), # Adjust params as needed ('tfidf', TfidfTransformer()) ]) # Preprocessing for new text feature text_preprocessor_2 = Pipeline(steps=[ ('count', CountVectorizer(ngram_range=(1,2))), # Use different params here if useful ('tfidf', TfidfTransformer()) ])
Step 3: Combine Preprocessors with ColumnTransformer
This applies each preprocessor to its target column and concatenates the results into one feature matrix:
preprocessor = ColumnTransformer( transformers=[ ('text1', text_preprocessor_1, 'Extract'), ('text2', text_preprocessor_2, 'Additional_Text') # Replace with your actual new column name ])
Step 4: Build Full Pipeline with Classifier
Combine the preprocessor with your chosen classifier (e.g., MultinomialNB or RandomForest):
# Example with MultinomialNB clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', MultinomialNB())]) # Split your data (note: we pass the dataframe subset with both text columns now) X_train, X_test, y_train, y_test = train_test_split( data[['Extract', 'Additional_Text']], data['Expense Account code Description'], random_state=42 ) # Train and evaluate clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
Method 2: Manual Feature Combination (Using hstack)
If you prefer a more explicit, step-by-step approach:
Step 1: Process Each Text Feature Separately
Critical: Fit vectorizers and transformers only on the training data to avoid data leakage:
# Original feature processing count_vec1 = CountVectorizer(stop_words='english') X_train_count1 = count_vec1.fit_transform(X_train['Extract']) X_test_count1 = count_vec1.transform(X_test['Extract']) tfidf_transformer1 = TfidfTransformer() X_train_tfidf1 = tfidf_transformer1.fit_transform(X_train_count1) X_test_tfidf1 = tfidf_transformer1.transform(X_test_count1) # New feature processing count_vec2 = CountVectorizer(ngram_range=(1,2)) X_train_count2 = count_vec2.fit_transform(X_train['Additional_Text']) X_test_count2 = count_vec2.transform(X_test['Additional_Text']) tfidf_transformer2 = TfidfTransformer() X_train_tfidf2 = tfidf_transformer2.fit_transform(X_train_count2) X_test_tfidf2 = tfidf_transformer2.transform(X_test_count2)
Step 2: Combine Feature Matrices
Use scipy.sparse.hstack to efficiently concatenate the sparse text feature matrices:
from scipy.sparse import hstack X_train_combined = hstack([X_train_tfidf1, X_train_tfidf2]) X_test_combined = hstack([X_test_tfidf1, X_test_tfidf2])
Step 3: Train and Evaluate
Now use the combined matrices with your classifier:
clf = MultinomialNB() clf.fit(X_train_combined, y_train) y_pred = clf.predict(X_test_combined) print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
Important Tips
- Avoid Data Leakage: Never fit preprocessing tools on your full dataset—always use the training set only for fitting, then transform both train and test sets. The
ColumnTransformermethod handles this automatically. - Customize Vectorizer Params: Tweak parameters like
stop_words,ngram_range, ormax_featuresfor each text feature independently to optimize model performance. - Sparse Matrices: Text features produce sparse matrices, so use
scipy.sparse.hstackinstead of numpy’s regularhstackto keep operations efficient.
内容的提问来源于stack exchange,提问作者Ayush Agrawal




