如何基于现有PCA结果推进降维并应用于cluster analysis？

阿华AIGC实验室

2026-4-28

How to Further Reduce PCA Dimensions for Cluster Analysis

Hey there! Let's walk through exactly how to build on the PCA work you've already completed to get a smaller set of PCs for your clustering task. I'll break this into actionable steps with R code examples tailored to your workflow.

Step 1: Extract the 6 PCs with Eigenvalues > 1

First, let's formalize extracting those 6 PCs you identified. We'll start by validating the eigenvalues (just to confirm) and then pull the corresponding PC scores:

# Calculate eigenvalues from your initial PCA
eigen_values <- Mypca$sdev^2
print("Eigenvalues of initial PCA:")
print(eigen_values)

# Get indices of PCs with eigenvalue > 1 (matches your 6 PCs)
selected_pc_indices <- which(eigen_values > 1)

# Extract the scores for these 6 PCs
pc_scores_6 <- Mypca$x[, selected_pc_indices]

Now you have a dataset with 6 columns, each representing one of your selected PCs.

Step 2: Further Reduce Dimensions Using Loadings

Next, we'll use the loadings matrix to identify redundancy or overlapping information among these 6 PCs, so we can trim them down even more. Here are a few practical approaches:

Option 1: Check for Correlated PCs

Even though PCA outputs orthogonal PCs in theory, real-world data might have minor correlations. If two PCs are highly correlated, they're explaining similar variance—we can keep the one with higher variance:

# Calculate correlation matrix for the 6 PCs
pc_correlation <- cor(pc_scores_6)
print("Correlation matrix of the 6 PCs:")
print(pc_correlation)

# Example: Keep PCs with correlation < 0.7 with all others
keep_pcs <- c()
for (i in 1:ncol(pc_scores_6)) {
  if (all(abs(pc_correlation[i, -i]) < 0.7)) {
    keep_pcs <- c(keep_pcs, i)
  }
}
trimmed_pc_scores <- pc_scores_6[, keep_pcs]

Option 2: Analyze Loadings for Interpretability

The loadings matrix shows how each original variable contributes to a PC. If multiple PCs load heavily on the same set of original variables, they're capturing the same underlying pattern—we can consolidate them:

# Extract loadings for the 6 PCs
loadings_6 <- Mypca$rotation[, selected_pc_indices]
print("Loadings for the 6 PCs:")
print(loadings_6)

# Visualize loadings to spot patterns (install ggplot2/reshape2 if needed)
library(ggplot2)
library(reshape2)

loadings_melted <- melt(loadings_6)
colnames(loadings_melted) <- c("Original_Variable", "PC", "Loading")

ggplot(loadings_melted, aes(x = Original_Variable, y = Loading, fill = PC)) +
  geom_col(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Variable Loadings for Selected PCs")

For example, if PC3 and PC5 both have high positive loadings on variables related to "customer spending", you can create a combined score weighted by their variance explanation:

# Calculate variance explanation for each of the 6 PCs
var_explained <- eigen_values[selected_pc_indices] / sum(eigen_values)

# Combine PC3 and PC5 into a single score
combined_spending_pc <- (pc_scores_6[,3] * var_explained[3]) + (pc_scores_6[,5] * var_explained[5])

# Replace PC3 and PC5 with the combined score
trimmed_pc_scores <- cbind(pc_scores_6[,c(1,2,4,6)], combined_spending_pc)

Option 3: Secondary PCA (For Aggressive Dimension Reduction)

If you want a more automated way to compress the 6 PCs further, run a second PCA on them. This will create new orthogonal PCs that capture the maximum variance in the 6-PC dataset:

# Run PCA on the 6 selected PCs
second_pca <- prcomp(pc_scores_6, center = TRUE, scale. = TRUE)

# Check eigenvalues for the secondary PCA
second_eigen <- second_pca$sdev^2
print("Eigenvalues of secondary PCA:")
print(second_eigen)

# Select PCs with eigenvalue >1 (or target cumulative variance, e.g., 80%)
final_pc_indices <- which(second_eigen > 1)
final_pc_scores <- second_pca$x[, final_pc_indices]

# Optional: Plot scree plot to visualize eigenvalue drop-off
plot(second_eigen, type = "b", xlab = "Principal Component", ylab = "Eigenvalue", main = "Scree Plot for Secondary PCA")
abline(h = 1, col = "red", lty = 2)

Note: Secondary PCA reduces interpretability (the new PCs are combinations of your original 6), but it's great for maximizing variance retention with fewer dimensions.

Step 3: Use the Trimmed PCs for Cluster Analysis

Once you have your final set of PCs (say, 3-4), you can feed them directly into your clustering algorithm. Here's an example with k-means:

# Set seed for reproducibility
set.seed(123)

# Run k-means clustering (adjust centers to your needs)
cluster_results <- kmeans(final_pc_scores, centers = 3)

# View cluster assignments and statistics
print(cluster_results)

# Add cluster labels back to your original data (optional)
Pca_for_R$cluster <- cluster_results$cluster

Key Tips

Prioritize interpretability if you need to explain your clustering results—go with loading-based consolidation over secondary PCA.
Aim to retain at least 70-80% of the total variance from the initial 6 PCs to ensure your clustering is robust.
Test different numbers of final PCs (e.g., 3 vs 4) to see which gives the most meaningful clusters.

内容的提问来源于stack exchange，提问作者Jannet Philip