使用TfidfVectorizer提取字符级特征解决神经网络姓名输入问题

阿华AIGC实验室

2026-5-25

Extracting Character-Level TF Features for Names with TfIdfVectorizer

Got it, let's break down how to solve this problem exactly as you need it. You want to pull character-level 2-gram TF (term frequency) features from your name samples, convert them into an array for classification, and use TfIdfVectorizer to do the heavy lifting. Here's a step-by-step solution with code examples:

Step 1: Set Up Dependencies and Data

First, import the necessary libraries and define your text dataset:

from sklearn.feature_extraction.text import TfIdfVectorizer
import numpy as np

# Your original name dataset
text = ['James Jackson Jammy', 'Steve Smith Something', 'Chamak Chalo Chanta', 'Polo Rolo Colo']

Step 2: Configure the TF Vectorizer

We need to tweak TfIdfVectorizer to work at the character level instead of the default word level, and focus only on TF values (not TF-IDF):

# Configure vectorizer for character 2-grams, TF-only (no IDF or normalization)
tf_vectorizer = TfIdfVectorizer(
    analyzer='char',          # Process text at the character level
    ngram_range=(2, 2),       # Extract only 2-character sequences (ja, am, etc.)
    use_idf=False,            # Disable IDF calculation to get pure TF values
    norm=None                 # Skip normalization so values are raw term counts
)

Step 3: Generate TF Features and Convert to Array

Fit the vectorizer to your dataset and transform the text into a numerical feature array:

# Fit to the corpus and convert text to TF feature matrix
tf_features_matrix = tf_vectorizer.fit_transform(text)

# Convert sparse matrix to a dense numpy array for classification use
tf_features_array = tf_features_matrix.toarray()

Step 4: Inspect the Results (Optional)

To verify what features were extracted and their values, you can print the feature names and corresponding TF values:

# Get all unique character 2-gram features
feature_names = tf_vectorizer.get_feature_names_out()
print("Extracted Character 2-gram Features:")
print(feature_names)

# Print TF values for each sample
print("\nTF Feature Array:")
print(tf_features_array)

Bonus: Extract TF Features for Individual Words (Name, Middle Name, Last Name)

If you need to process each name component separately (instead of the full name string), split the text into individual words first:

# Split all names into individual words (first, middle, last)
all_individual_words = [word for full_name in text for word in full_name.split()]

# Reuse the vectorizer (or create a new one) to process individual words
tf_word_vectorizer = TfIdfVectorizer(analyzer='char', ngram_range=(2,2), use_idf=False, norm=None)
tf_word_features = tf_word_vectorizer.fit_transform(all_individual_words)
tf_word_features_array = tf_word_features.toarray()

# Print results for each word
print("\nTF Features for Individual Words:")
for word, features in zip(all_individual_words, tf_word_features_array):
    print(f"\nWord: {word}")
    # Only show 2-grams that actually appear in the word
    active_features = {feature_names[i]: features[i] for i in range(len(features)) if features[i] > 0}
    print("Character 2-gram TF Values:", active_features)

Save the Feature Array

To save the array for later use in classification:

# Save full name features
np.save('full_name_tf_features.npy', tf_features_array)

# Save individual word features (if needed)
np.save('individual_word_tf_features.npy', tf_word_features_array)

Key Notes:

analyzer='char' is critical here—it tells the vectorizer to look at characters instead of whole words.
ngram_range=(2,2) restricts us to exactly 2-character sequences. Adjust this if you want longer/shorter character groups (e.g., (1,3) for 1, 2, and 3-character n-grams).
Disabling use_idf and norm ensures you get raw term counts (TF values) instead of scaled TF-IDF scores.

内容的提问来源于stack exchange，提问作者Raady