You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何基于现有PCA结果推进降维并应用于cluster analysis?

How to Further Reduce PCA Dimensions for Cluster Analysis

Hey there! Let's walk through exactly how to build on the PCA work you've already completed to get a smaller set of PCs for your clustering task. I'll break this into actionable steps with R code examples tailored to your workflow.

Step 1: Extract the 6 PCs with Eigenvalues > 1

First, let's formalize extracting those 6 PCs you identified. We'll start by validating the eigenvalues (just to confirm) and then pull the corresponding PC scores:

# Calculate eigenvalues from your initial PCA
eigen_values <- Mypca$sdev^2
print("Eigenvalues of initial PCA:")
print(eigen_values)

# Get indices of PCs with eigenvalue > 1 (matches your 6 PCs)
selected_pc_indices <- which(eigen_values > 1)

# Extract the scores for these 6 PCs
pc_scores_6 <- Mypca$x[, selected_pc_indices]

Now you have a dataset with 6 columns, each representing one of your selected PCs.

Step 2: Further Reduce Dimensions Using Loadings

Next, we'll use the loadings matrix to identify redundancy or overlapping information among these 6 PCs, so we can trim them down even more. Here are a few practical approaches:

Option 1: Check for Correlated PCs

Even though PCA outputs orthogonal PCs in theory, real-world data might have minor correlations. If two PCs are highly correlated, they're explaining similar variance—we can keep the one with higher variance:

# Calculate correlation matrix for the 6 PCs
pc_correlation <- cor(pc_scores_6)
print("Correlation matrix of the 6 PCs:")
print(pc_correlation)

# Example: Keep PCs with correlation < 0.7 with all others
keep_pcs <- c()
for (i in 1:ncol(pc_scores_6)) {
  if (all(abs(pc_correlation[i, -i]) < 0.7)) {
    keep_pcs <- c(keep_pcs, i)
  }
}
trimmed_pc_scores <- pc_scores_6[, keep_pcs]

Option 2: Analyze Loadings for Interpretability

The loadings matrix shows how each original variable contributes to a PC. If multiple PCs load heavily on the same set of original variables, they're capturing the same underlying pattern—we can consolidate them:

# Extract loadings for the 6 PCs
loadings_6 <- Mypca$rotation[, selected_pc_indices]
print("Loadings for the 6 PCs:")
print(loadings_6)

# Visualize loadings to spot patterns (install ggplot2/reshape2 if needed)
library(ggplot2)
library(reshape2)

loadings_melted <- melt(loadings_6)
colnames(loadings_melted) <- c("Original_Variable", "PC", "Loading")

ggplot(loadings_melted, aes(x = Original_Variable, y = Loading, fill = PC)) +
  geom_col(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Variable Loadings for Selected PCs")

For example, if PC3 and PC5 both have high positive loadings on variables related to "customer spending", you can create a combined score weighted by their variance explanation:

# Calculate variance explanation for each of the 6 PCs
var_explained <- eigen_values[selected_pc_indices] / sum(eigen_values)

# Combine PC3 and PC5 into a single score
combined_spending_pc <- (pc_scores_6[,3] * var_explained[3]) + (pc_scores_6[,5] * var_explained[5])

# Replace PC3 and PC5 with the combined score
trimmed_pc_scores <- cbind(pc_scores_6[,c(1,2,4,6)], combined_spending_pc)

Option 3: Secondary PCA (For Aggressive Dimension Reduction)

If you want a more automated way to compress the 6 PCs further, run a second PCA on them. This will create new orthogonal PCs that capture the maximum variance in the 6-PC dataset:

# Run PCA on the 6 selected PCs
second_pca <- prcomp(pc_scores_6, center = TRUE, scale. = TRUE)

# Check eigenvalues for the secondary PCA
second_eigen <- second_pca$sdev^2
print("Eigenvalues of secondary PCA:")
print(second_eigen)

# Select PCs with eigenvalue >1 (or target cumulative variance, e.g., 80%)
final_pc_indices <- which(second_eigen > 1)
final_pc_scores <- second_pca$x[, final_pc_indices]

# Optional: Plot scree plot to visualize eigenvalue drop-off
plot(second_eigen, type = "b", xlab = "Principal Component", ylab = "Eigenvalue", main = "Scree Plot for Secondary PCA")
abline(h = 1, col = "red", lty = 2)

Note: Secondary PCA reduces interpretability (the new PCs are combinations of your original 6), but it's great for maximizing variance retention with fewer dimensions.

Step 3: Use the Trimmed PCs for Cluster Analysis

Once you have your final set of PCs (say, 3-4), you can feed them directly into your clustering algorithm. Here's an example with k-means:

# Set seed for reproducibility
set.seed(123)

# Run k-means clustering (adjust centers to your needs)
cluster_results <- kmeans(final_pc_scores, centers = 3)

# View cluster assignments and statistics
print(cluster_results)

# Add cluster labels back to your original data (optional)
Pca_for_R$cluster <- cluster_results$cluster

Key Tips

  • Prioritize interpretability if you need to explain your clustering results—go with loading-based consolidation over secondary PCA.
  • Aim to retain at least 70-80% of the total variance from the initial 6 PCs to ensure your clustering is robust.
  • Test different numbers of final PCs (e.g., 3 vs 4) to see which gives the most meaningful clusters.

内容的提问来源于stack exchange,提问作者Jannet Philip

火山引擎 最新活动