如何基于现有PCA结果推进降维并应用于cluster analysis?
Hey there! Let's walk through exactly how to build on the PCA work you've already completed to get a smaller set of PCs for your clustering task. I'll break this into actionable steps with R code examples tailored to your workflow.
Step 1: Extract the 6 PCs with Eigenvalues > 1
First, let's formalize extracting those 6 PCs you identified. We'll start by validating the eigenvalues (just to confirm) and then pull the corresponding PC scores:
# Calculate eigenvalues from your initial PCA eigen_values <- Mypca$sdev^2 print("Eigenvalues of initial PCA:") print(eigen_values) # Get indices of PCs with eigenvalue > 1 (matches your 6 PCs) selected_pc_indices <- which(eigen_values > 1) # Extract the scores for these 6 PCs pc_scores_6 <- Mypca$x[, selected_pc_indices]
Now you have a dataset with 6 columns, each representing one of your selected PCs.
Step 2: Further Reduce Dimensions Using Loadings
Next, we'll use the loadings matrix to identify redundancy or overlapping information among these 6 PCs, so we can trim them down even more. Here are a few practical approaches:
Option 1: Check for Correlated PCs
Even though PCA outputs orthogonal PCs in theory, real-world data might have minor correlations. If two PCs are highly correlated, they're explaining similar variance—we can keep the one with higher variance:
# Calculate correlation matrix for the 6 PCs pc_correlation <- cor(pc_scores_6) print("Correlation matrix of the 6 PCs:") print(pc_correlation) # Example: Keep PCs with correlation < 0.7 with all others keep_pcs <- c() for (i in 1:ncol(pc_scores_6)) { if (all(abs(pc_correlation[i, -i]) < 0.7)) { keep_pcs <- c(keep_pcs, i) } } trimmed_pc_scores <- pc_scores_6[, keep_pcs]
Option 2: Analyze Loadings for Interpretability
The loadings matrix shows how each original variable contributes to a PC. If multiple PCs load heavily on the same set of original variables, they're capturing the same underlying pattern—we can consolidate them:
# Extract loadings for the 6 PCs loadings_6 <- Mypca$rotation[, selected_pc_indices] print("Loadings for the 6 PCs:") print(loadings_6) # Visualize loadings to spot patterns (install ggplot2/reshape2 if needed) library(ggplot2) library(reshape2) loadings_melted <- melt(loadings_6) colnames(loadings_melted) <- c("Original_Variable", "PC", "Loading") ggplot(loadings_melted, aes(x = Original_Variable, y = Loading, fill = PC)) + geom_col(position = "dodge") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(title = "Variable Loadings for Selected PCs")
For example, if PC3 and PC5 both have high positive loadings on variables related to "customer spending", you can create a combined score weighted by their variance explanation:
# Calculate variance explanation for each of the 6 PCs var_explained <- eigen_values[selected_pc_indices] / sum(eigen_values) # Combine PC3 and PC5 into a single score combined_spending_pc <- (pc_scores_6[,3] * var_explained[3]) + (pc_scores_6[,5] * var_explained[5]) # Replace PC3 and PC5 with the combined score trimmed_pc_scores <- cbind(pc_scores_6[,c(1,2,4,6)], combined_spending_pc)
Option 3: Secondary PCA (For Aggressive Dimension Reduction)
If you want a more automated way to compress the 6 PCs further, run a second PCA on them. This will create new orthogonal PCs that capture the maximum variance in the 6-PC dataset:
# Run PCA on the 6 selected PCs second_pca <- prcomp(pc_scores_6, center = TRUE, scale. = TRUE) # Check eigenvalues for the secondary PCA second_eigen <- second_pca$sdev^2 print("Eigenvalues of secondary PCA:") print(second_eigen) # Select PCs with eigenvalue >1 (or target cumulative variance, e.g., 80%) final_pc_indices <- which(second_eigen > 1) final_pc_scores <- second_pca$x[, final_pc_indices] # Optional: Plot scree plot to visualize eigenvalue drop-off plot(second_eigen, type = "b", xlab = "Principal Component", ylab = "Eigenvalue", main = "Scree Plot for Secondary PCA") abline(h = 1, col = "red", lty = 2)
Note: Secondary PCA reduces interpretability (the new PCs are combinations of your original 6), but it's great for maximizing variance retention with fewer dimensions.
Step 3: Use the Trimmed PCs for Cluster Analysis
Once you have your final set of PCs (say, 3-4), you can feed them directly into your clustering algorithm. Here's an example with k-means:
# Set seed for reproducibility set.seed(123) # Run k-means clustering (adjust centers to your needs) cluster_results <- kmeans(final_pc_scores, centers = 3) # View cluster assignments and statistics print(cluster_results) # Add cluster labels back to your original data (optional) Pca_for_R$cluster <- cluster_results$cluster
Key Tips
- Prioritize interpretability if you need to explain your clustering results—go with loading-based consolidation over secondary PCA.
- Aim to retain at least 70-80% of the total variance from the initial 6 PCs to ensure your clustering is robust.
- Test different numbers of final PCs (e.g., 3 vs 4) to see which gives the most meaningful clusters.
内容的提问来源于stack exchange,提问作者Jannet Philip




