预训练模型（如VGG、ResNet）提取图像特征的语义相似度衡量方法及VGG-16特征聚类优化咨询

阿华AIGC实验室

2026-4-30

Great question! Let's break this down step by step:

Can VGG-16 extract semantic information from images?

Absolutely. VGG-16 (and most pre-trained CNNs) learns hierarchical feature representations that capture semantic meaning at different levels:

Lower layers (early convolutional blocks) focus on low-level visual details like edges, textures, and basic shapes.
Higher layers (later conv blocks and early classifier layers) encode mid-to-high level semantic signals, such as object parts (e.g., "car wheel", "cat ear") and abstract category-related concepts.

So yes, it’s fully capable of extracting features that carry meaningful semantic information — which makes it a valid choice for unlabeled image clustering.

Why your clustering isn’t working well, and actionable fixes

Your core approach is on the right track, but there are several tweaks to refine your workflow. Here are concrete suggestions to improve clustering performance:

Pick the right feature layer for clustering
You’re using the output of a fine-tuned classifier head, but this might not be optimal. Fine-tuning the classifier (even just adjusting its output dimension) can introduce task-specific biases that dilute general semantic features. Instead:
- Try using the raw flattened output of the vgg.features block (25088 dimensions) — this is the unmodified hierarchical feature representation from the pre-trained model.
- Or use the output of the first two layers of the original pre-trained classifier (before the final classification head):
```
# Freeze pre-trained weights to avoid unintended modifications
for param in vgg.parameters():
    param.requires_grad = False

# Extract features from the first 4 layers of the classifier (4096-dim output)
features = vgg.features(X).view(X.shape[0], -1)
features = vgg.classifier[:4](features)
```
These pre-trained intermediate features are more general and better suited for capturing semantic similarity across unlabeled data.
Fix input preprocessing to match VGG’s training setup
VGG-16 was trained with specific input normalization — skipping this leads to inconsistent feature values that hurt clustering. Apply this preprocessing to your resized CIFAR-10 images:
```
from torchvision import transforms

preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```
Normalize features before clustering
Magnitude differences in feature vectors can dominate similarity metrics like cosine similarity. Apply L2 normalization to ensure all vectors are on the same scale:
```
import torch.nn.functional as F

features = F.normalize(features, p=2, dim=1)
```
Reduce feature dimensionality
4096-dimensional features suffer from the curse of dimensionality, which makes clustering algorithms less effective. Use PCA to trim features to a manageable size (e.g., 256 or 512 dimensions):
```
from sklearn.decomposition import PCA

pca = PCA(n_components=256)
features_pca = pca.fit_transform(features.detach().numpy())
```
Use the reduced-dimension features for clustering afterward.
Tune your clustering algorithm
- If using K-Means, set n_clusters=10 (to match CIFAR-10’s class count) and use init='k-means++' (the default in scikit-learn) for better cluster initialization.
- Experiment with different distance metrics: Euclidean distance with K-Means often works well alongside normalized features, complementing cosine similarity.
- For datasets with unclear cluster counts, try density-based algorithms like DBSCAN — though K-Means is typically more effective for CIFAR-10’s distinct class clusters.
Skip unnecessary fine-tuning
Since you’re working with unlabeled data, fine-tuning the classifier head doesn’t help and may introduce noise. Stick to pre-trained weights for feature extraction unless you plan to use self-supervised learning to pre-tune on the unlabeled dataset first.