MobileNet V3为何比V2更快？SSD骨干网应用中的技术疑问

阿华AIGC实验室

2026-5-13

Why MobileNet V3 is Faster Than V2 Even With h-swish and SE Modules?

Great question—this is a super common point of confusion because it’s easy to fixate on individual modules like h-swish and SE, but MobileNet V3’s speedup comes from holistic architecture tweaks, not just slapping those components onto V2. Let’s break this down:

Targeted pruning & reshaping of V2’s core blocks
MobileNet V3 didn’t just add new modules to V2—it systematically analyzed which parts of V2’s inverted residual blocks were redundant. It trimmed unnecessary layers, adjusted channel counts in key blocks to balance compute and feature power, and optimized stride settings for early/middle layers. For your SSD use case (where you’re using the backbone’s lower/feature-rich layers), these changes directly reduce the number of operations needed to generate the features you care about.
h-swish is built for mobile hardware, not just "faster than swish"
You’re right that h-swish looks more complex than ReLU on paper, but it’s designed to play nice with mobile GPUs/NPUs:
- It’s implemented as ReLU6(x + 3) / 6—ReLU6 is a heavily optimized operator on most mobile chips, and this formula lets the operation fuse with adjacent convolutions into a single hardware instruction.
- Unlike swish (which uses a sigmoid, super expensive on mobile), h-swish avoids costly transcendental functions. Its actual runtime overhead is tiny—often negligible compared to the accuracy boost it gives.
SE modules are stripped-down and mobile-optimized
The SE blocks in V3 aren’t the full-fat versions you might see in other networks:
- They use a small reduction ratio (usually 4) to keep parameter count and computation minimal.
- The channel-wise pooling and scaling steps are highly parallelizable on mobile hardware, so they add barely any latency. Plus, the accuracy gain from SE lets the network use slightly smaller channel counts elsewhere, offsetting any minor overhead.
Detection-specific optimizations in the paper’s benchmarks
The detection comparison in the paper uses MobileNet V3 variants tailored specifically for object detection. These versions adjust the backbone’s feature pyramid outputs and strip out unnecessary top layers (which you’re already doing), making the network way more efficient at generating the multi-scale features SSD needs—something MobileNet V2 wasn’t explicitly optimized for.

In short, you weren’t missing a single trick—you just needed to look beyond individual modules to the full architectural overhaul. The mix of pruning, hardware-aware operator design, and task-specific tuning makes V3 faster than V2 even with those "accuracy-focused" modules.

内容的提问来源于stack exchange，提问作者Jefferson Chiu