卷积输出深度的计算方法及输出特征图数量的确认（含公式疑问）

阿华AIGC实验室

2026-5-15

Convolution Output Depth Questions Answered

Hey there! Let's break down your questions about convolutional neural network output depth clearly and straightforwardly:

1. How to calculate the depth of a convolution output?

The depth of your convolution output (also called the number of feature maps) depends on the type of convolution you're using:

Standard Convolution: This is the most common type. The output depth is exactly equal to the number of filters (kernels) you define for the convolution layer. For example, if you use 64 3x3 filters on a 3-channel RGB image, your output depth will be 64—notice the input depth (3) doesn't affect this number here.
Special Convolution Variants:
- Grouped Convolution: If you split the input depth D into G groups and use K filters per group, the output depth becomes G * K.
- Depthwise Separable Convolution: The first (depthwise) step outputs a depth equal to the input depth D; the second (pointwise) step's output depth equals the number of pointwise filters you use.

2. How to get the number of feature maps (i.e., depth H) from a convolution output?

The number of feature maps is just the output depth value, and you can get it in two ways:

In Code: If you're using frameworks like PyTorch or TensorFlow, check the shape of the output tensor. For PyTorch (which uses [batch_size, channels, height, width]), you'd use output.shape[1] to get the feature map count. For TensorFlow (default [batch_size, height, width, channels]), it's output.shape[-1].
Manual Calculation: Follow the rules from the first question—for standard convolutions, it's just the number of filters; for special types, use their respective formulas.

About the formula `H = D * number of filters`

First off, this formula doesn't hold for standard convolutions—as we covered, standard convolution output depth only depends on the number of filters, not the input depth D.

That said, this formula might come from a specific context or a typo in a learning resource:

It could refer to a niche convolution variant where each input channel is paired with number of filters separate kernels (though this isn't a standard practice).
Alternatively, it might be a misstatement of depthwise convolution paired with additional per-channel filters, but that's not a standard formulation.

Since you mentioned a source image for this formula, without seeing the image itself, it's hard to pin down its exact origin—but it's likely tied to a specific, non-standard convolution setup rather than the general case.

内容的提问来源于stack exchange，提问作者Sakhri Houssem