如何用Python复刻R的文本余弦相似度并实现一致的层次聚类?
Hey there! It’s super common to hit discrepancies between R and Python when working with text clustering—small differences in preprocessing, matrix setup, or clustering parameters can throw off results entirely. Let’s break down the likely culprits and fix this step by step.
First, Align Your Text Preprocessing
You mentioned you’ve done basic tokenization, but R’s tm package and Python’s libraries (like NLTK/Scikit-learn) have subtle defaults that can diverge. Let’s replicate R’s standard preprocessing in Python to eliminate this variable:
R’s Typical Preprocessing (for reference)
library(tm) corpus <- VCorpus(VectorSource(df$CDESCR)) corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stemDocument) # Uses Porter stemmer by default tdm <- DocumentTermMatrix(corpus)
Python Equivalent Preprocessing
import pandas as pd import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer import string # Download NLTK resources if you haven't already nltk.download('punkt') nltk.download('stopwords') # Load your dataset (replace with your file path) df = pd.read_csv("your_dataset.csv") def preprocess_text(text): # Match R's tolower() text = text.lower() # Remove punctuation (matches removePunctuation) text = text.translate(str.maketrans('', '', string.punctuation)) # Remove numbers (matches removeNumbers) text = ''.join([char for char in text if not char.isdigit()]) # Tokenize (equivalent to tm's internal tokenizer) tokens = word_tokenize(text) # Remove stopwords (matches removeWords(stopwords("english"))) stop_words = set(stopwords.words('english')) tokens = [tok for tok in tokens if tok not in stop_words] # Stemming (Porter stemmer matches R's stemDocument) stemmer = PorterStemmer() tokens = [stemmer.stem(tok) for tok in tokens] return ' '.join(tokens) # Apply to your CDESCR column df["processed_text"] = df["CDESCR"].apply(preprocess_text)
Match Your Term-Document Matrix (TDM) Setup
Next, make sure your document-term matrix in Python matches exactly what R generates:
- If R uses raw word counts (default in
DocumentTermMatrix), use Python’sCountVectorizer. - If R uses TF-IDF, switch to
TfidfVectorizer. - If you filtered sparse terms in R (e.g.,
removeSparseTerms(tdm, 0.98)), mirror this withmin_dfin Python (e.g.,min_df=2for terms appearing in at least 2 documents).
from sklearn.feature_extraction.text import CountVectorizer # Create DTM (matches R's DocumentTermMatrix) vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(df["processed_text"]) dtm_array = dtm.toarray()
Fix Cosine Distance Calculation
This is a frequent source of mismatch! R and Python handle similarity/distance matrices differently:
- R’s
cosine()(fromlsapackage) calculates similarity between rows—so if you’re comparing documents, you’ll transpose the TDM first:1 - cosine(t(tdm))to get a distance matrix. - Python’s
cosine_similarity()calculates similarity between documents directly (since our DTM has documents as rows). Convert this to a distance matrix by subtracting from 1.
from sklearn.metrics.pairwise import cosine_similarity from scipy.spatial.distance import squareform # Calculate document-to-document cosine similarity cos_sim = cosine_similarity(dtm_array) # Convert to distance matrix (1 - similarity) cos_dist = 1 - cos_sim # For scipy's clustering, convert to condensed distance matrix (matches R's as.dist()) condensed_dist = squareform(cos_dist)
Align Hierarchical Clustering Parameters
The biggest mismatch often comes from clustering method defaults:
- R’s
hclust()usesmethod="complete"by default; if you usedmethod="ward.D2"(common for text), Python’s equivalent ismethod="ward"in scipy/sklearn. - Sklearn’s
AgglomerativeClusteringuseslinkage="ward"by default, but ensure you pass the precomputed distance matrix correctly.
Option 1: Use Scipy to Mirror R’s hclust
This will give you a dendrogram identical to R’s for easy comparison:
import scipy.cluster.hierarchy as sch # Perform clustering (ward.D2 in R = "ward" in scipy) hc = sch.linkage(condensed_dist, method="ward") # Plot dendrogram to compare with R's output sch.dendrogram(hc)
Option 2: Use Sklearn for Cluster Labels
If you need cluster assignments:
from sklearn.cluster import AgglomerativeClustering # Pass precomputed distance matrix, use ward linkage clusterer = AgglomerativeClustering( n_clusters=5, # Adjust to your desired number of clusters affinity="precomputed", linkage="ward" ) cluster_labels = clusterer.fit_predict(cos_dist) # Add labels to your dataframe df["cluster"] = cluster_labels
Final Check List
To ensure full alignment:
- Confirm you’re using the same stemming/lemmatization (Porter stemmer in both R and Python)
- Verify you’re filtering the same low-frequency terms (match
min_dfto R’sremoveSparseTerms) - Check if R normalized the TDM (e.g.,
scale()—if so, addnormalize=Trueto Python’s vectorizer) - Ensure clustering method matches exactly (e.g., "ward.D2" → "ward", "complete" → "complete")
内容的提问来源于stack exchange,提问作者phoenixio




