如何用Python复刻R的文本余弦相似度并实现一致的层次聚类？

阿华AIGC实验室

2026-5-19

Fixing Inconsistent Hierarchical Clustering Results Between R and Python for Text Data

Hey there! It’s super common to hit discrepancies between R and Python when working with text clustering—small differences in preprocessing, matrix setup, or clustering parameters can throw off results entirely. Let’s break down the likely culprits and fix this step by step.

First, Align Your Text Preprocessing

You mentioned you’ve done basic tokenization, but R’s tm package and Python’s libraries (like NLTK/Scikit-learn) have subtle defaults that can diverge. Let’s replicate R’s standard preprocessing in Python to eliminate this variable:

R’s Typical Preprocessing (for reference)

library(tm)
corpus <- VCorpus(VectorSource(df$CDESCR))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument) # Uses Porter stemmer by default
tdm <- DocumentTermMatrix(corpus)

Python Equivalent Preprocessing

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Download NLTK resources if you haven't already
nltk.download('punkt')
nltk.download('stopwords')

# Load your dataset (replace with your file path)
df = pd.read_csv("your_dataset.csv")

def preprocess_text(text):
    # Match R's tolower()
    text = text.lower()
    # Remove punctuation (matches removePunctuation)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers (matches removeNumbers)
    text = ''.join([char for char in text if not char.isdigit()])
    # Tokenize (equivalent to tm's internal tokenizer)
    tokens = word_tokenize(text)
    # Remove stopwords (matches removeWords(stopwords("english")))
    stop_words = set(stopwords.words('english'))
    tokens = [tok for tok in tokens if tok not in stop_words]
    # Stemming (Porter stemmer matches R's stemDocument)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(tok) for tok in tokens]
    return ' '.join(tokens)

# Apply to your CDESCR column
df["processed_text"] = df["CDESCR"].apply(preprocess_text)

Match Your Term-Document Matrix (TDM) Setup

Next, make sure your document-term matrix in Python matches exactly what R generates:

If R uses raw word counts (default in DocumentTermMatrix), use Python’s CountVectorizer.
If R uses TF-IDF, switch to TfidfVectorizer.
If you filtered sparse terms in R (e.g., removeSparseTerms(tdm, 0.98)), mirror this with min_df in Python (e.g., min_df=2 for terms appearing in at least 2 documents).

from sklearn.feature_extraction.text import CountVectorizer

# Create DTM (matches R's DocumentTermMatrix)
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(df["processed_text"])
dtm_array = dtm.toarray()

Fix Cosine Distance Calculation

This is a frequent source of mismatch! R and Python handle similarity/distance matrices differently:

R’s cosine() (from lsa package) calculates similarity between rows—so if you’re comparing documents, you’ll transpose the TDM first: 1 - cosine(t(tdm)) to get a distance matrix.
Python’s cosine_similarity() calculates similarity between documents directly (since our DTM has documents as rows). Convert this to a distance matrix by subtracting from 1.

from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform

# Calculate document-to-document cosine similarity
cos_sim = cosine_similarity(dtm_array)
# Convert to distance matrix (1 - similarity)
cos_dist = 1 - cos_sim

# For scipy's clustering, convert to condensed distance matrix (matches R's as.dist())
condensed_dist = squareform(cos_dist)

Align Hierarchical Clustering Parameters

The biggest mismatch often comes from clustering method defaults:

R’s hclust() uses method="complete" by default; if you used method="ward.D2" (common for text), Python’s equivalent is method="ward" in scipy/sklearn.
Sklearn’s AgglomerativeClustering uses linkage="ward" by default, but ensure you pass the precomputed distance matrix correctly.

Option 1: Use Scipy to Mirror R’s `hclust`

This will give you a dendrogram identical to R’s for easy comparison:

import scipy.cluster.hierarchy as sch

# Perform clustering (ward.D2 in R = "ward" in scipy)
hc = sch.linkage(condensed_dist, method="ward")

# Plot dendrogram to compare with R's output
sch.dendrogram(hc)

Option 2: Use Sklearn for Cluster Labels

If you need cluster assignments:

from sklearn.cluster import AgglomerativeClustering

# Pass precomputed distance matrix, use ward linkage
clusterer = AgglomerativeClustering(
    n_clusters=5, # Adjust to your desired number of clusters
    affinity="precomputed",
    linkage="ward"
)
cluster_labels = clusterer.fit_predict(cos_dist)

# Add labels to your dataframe
df["cluster"] = cluster_labels

Final Check List

To ensure full alignment:

Confirm you’re using the same stemming/lemmatization (Porter stemmer in both R and Python)
Verify you’re filtering the same low-frequency terms (match min_df to R’s removeSparseTerms)
Check if R normalized the TDM (e.g., scale()—if so, add normalize=True to Python’s vectorizer)
Ensure clustering method matches exactly (e.g., "ward.D2" → "ward", "complete" → "complete")

内容的提问来源于stack exchange，提问作者phoenixio