多分类图像分类CNN模型评估指标咨询：除现有指标外还有哪些可选？

阿华AIGC实验室

2026-5-8

Useful Evaluation Metrics for 3-Class CNN Image Classification

Great question! When you're working on 3-class image classification with CNNs, accuracy, confusion matrices, and loss plots are solid foundational tools, but there are tons of other metrics that can give you a much deeper understanding of how your model is performing—especially if your dataset is imbalanced, or if you care more about specific classes than overall correctness.

Let’s break down the most valuable ones, tailored to multi-class scenarios:

Per-Class Performance Metrics

These help you dig into how well your model handles each individual class, which accuracy alone can hide:

Precision & Recall: For each class, precision answers: "When my model predicts this class, how often is it actually correct?" Recall asks: "How many actual instances of this class does my model correctly identify?" For 3 classes, you can compute these per-class, then use:
- Macro average: Equal-weighted average across all classes (great for balanced datasets).
- Micro average: Global calculation using all true positives/negatives and false positives/negatives across classes (ideal if you care about overall correctness regardless of class size).
- Weighted average: Average weighted by the number of samples in each class (perfect for imbalanced datasets).
F1-Score: The harmonic mean of precision and recall, giving you a single number that balances both metrics. Like precision/recall, you can compute per-class F1 scores plus macro/micro/weighted averages—this is perfect for concise model comparisons when you need to balance false positives and false negatives.

Multi-Class-Specific Aggregate Metrics

These metrics account for the full complexity of 3-class predictions, avoiding the pitfalls of accuracy:

Cohen's Kappa: This measures agreement between your model's predictions and true labels, adjusting for chance agreement (something accuracy ignores). It’s invaluable for imbalanced datasets where a model could get high accuracy just by guessing the majority class. Kappa ranges from -1 (complete disagreement) to 1 (perfect agreement), with 0 meaning no better than random.
Matthew's Correlation Coefficient (MCC): Similar to Kappa but often preferred for multi-class tasks. It uses all elements of the confusion matrix to produce a score between -1 and 1: 1 = perfect prediction, 0 = random guessing, -1 = perfect disagreement. It’s extremely robust to class imbalance, making it a go-to for many real-world multi-class problems.
Top-K Accuracy: If your model outputs class probabilities, top-k accuracy checks how often the true label falls within the top k most likely predictions. For example, top-2 accuracy counts a prediction as correct if the true class is either the first or second highest-probability output. This is great if your use case allows for near-misses (like image search or recommendation systems).

Probability-Focused Metrics

These metrics leverage the model's predicted probabilities (not just hard class labels) to assess confidence and calibration:

Log Loss (Cross-Entropy Loss): You’re already plotting training/validation loss, but log loss specifically quantifies how well the model’s probability estimates match the true labels. It penalizes confident wrong predictions more heavily, so it can reveal if your model is overconfident when it’s wrong—something accuracy can’t tell you.
Brier Score: The mean squared difference between predicted probabilities and one-hot encoded true labels. It’s more interpretable than log loss, ranging from 0 (perfect predictions) to 1 (worst possible). It tells you how close your model’s probability estimates are to the actual truth, which is critical if you need reliable confidence scores.

Visualization Tools (Beyond Confusion Matrices)

Visuals can reveal patterns numerical metrics miss:

One-vs-Rest ROC Curves & AUC-ROC: For multi-class, plot ROC curves by treating each class as the "positive" class against the other two. AUC-ROC for each class measures how well the model can distinguish that class from the rest. You can also compute macro/micro-average AUC for an overall performance summary.
Precision-Recall Curves & AUC-PR: These are more informative than ROC curves for imbalanced datasets. For each class, plot precision against recall at different probability thresholds, then calculate AUC-PR to summarize the curve. A higher AUC-PR means better performance for that class.
Class Activation Maps (CAMs): While not a numerical metric, CAMs visualize which parts of the image the model uses to make predictions. This helps you verify if your model is learning meaningful features (e.g., focusing on a dog’s face instead of the grass) or relying on spurious patterns.

Quick Practical Tip

Always align your metrics with your use case:

If you’re detecting a rare class (like a rare disease), prioritize recall, F1-score, and MCC.
If you need your model to be confidently correct, use log loss or Brier score.
If your application allows for top-2 guesses, add top-k accuracy to your evaluation.

内容的提问来源于stack exchange，提问作者Mamen