While techniques like pruning and quantization modify existing large models to make them smaller and faster, another approach is to design network architectures that are efficient from the ground up. These architectures incorporate building blocks specifically engineered to minimize computational cost (measured in Floating Point Operations or FLOPs) and parameter count, making them suitable for deployment on mobile devices, embedded systems, and other environments with limited resources.
Core Principles of Efficient Architecture Design
The central idea behind many efficient architectures is to replace computationally expensive operations, like standard convolution layers, with cheaper alternatives that approximate their function effectively. Two fundamental techniques stand out:
-
Depthwise Separable Convolutions: This is perhaps the most influential innovation in efficient CNN design, popularized by MobileNets. A standard 3×3 convolution operates across all input channels simultaneously to produce each output feature map. Depthwise separable convolution breaks this into two distinct steps:
- Depthwise Convolution: A spatial convolution is applied independently to each input channel. If you have Cin input channels, you use Cin separate K×K filters (e.g., 3×3). This step filters spatial information within each channel but doesn't combine information across channels.
- Pointwise Convolution: A 1×1 convolution is applied across the channels. This step takes the output of the depthwise convolution and combines the information across channels to produce the final Cout output channels. It uses Cout filters of size 1×1×Cin.
The computational savings are significant. For a standard convolution with Cin input channels, Cout output channels, filter size K×K, and feature map size H×W, the cost is approximately:
CostStandard≈H×W×K×K×Cin×Cout
For a depthwise separable convolution:
CostDepthwise≈H×W×K×K×Cin(Depthwise step)
CostPointwise≈H×W×1×1×Cin×Cout(Pointwise step)
CostSeparable=CostDepthwise+CostPointwise
CostSeparable≈H×W×Cin×(K×K+Cout)
The reduction factor compared to standard convolution is roughly:
CostStandardCostSeparable≈H×W×K×K×Cin×CoutH×W×Cin×(K×K+Cout)=Cout1+K21
For typical values like K=3 and large Cout, the cost reduction is often around 8−9×.
Comparison between standard convolution and depthwise separable convolution operations. The separable version breaks the process into two computationally cheaper steps.
-
Group Convolutions: Introduced initially in AlexNet to handle memory limitations, group convolutions divide the input channels into several groups. Convolutions are then performed independently within each group. If you divide Cin channels into G groups, each convolution operates on Cin/G input channels to produce Cout/G output channels. These are then concatenated. This reduces the parameter count and computation by a factor of G. Depthwise convolution is an extreme case where the number of groups equals the number of input channels (G=Cin).
-
Channel Shuffling: Used notably in ShuffleNets, this operation helps information flow between channel groups when using group convolutions. After a group convolution, the channels in the output feature map are "shuffled" or rearranged before being fed into the next group convolution. This ensures that the next layer can process information originating from different groups in the previous layer, mitigating the potential isolation caused by group convolutions.
Notable Efficient Architectures
MobileNets (V1, V2, V3)
The MobileNet family pioneered the large-scale use of depthwise separable convolutions.
- MobileNetV1: Directly replaced most standard convolutions with depthwise separable ones. Introduced width (channel count) and resolution multipliers as hyperparameters to easily trade off accuracy and latency/size.
- MobileNetV2: Introduced the inverted residual block with linear bottlenecks. Residual connections, similar to ResNet, help gradient flow in deep networks. The "inverted" structure means the block first uses a 1×1 convolution to expand the channel dimension, applies the lightweight 3×3 depthwise convolution in the expanded space, and then uses a 1×1 linear convolution (without ReLU) to project it back down. This was found to prevent information loss in the narrow layers.
- MobileNetV3: Incorporated ideas from Neural Architecture Search (NAS), Squeeze-and-Excitation (SE) modules (a form of channel attention), and an updated block structure (using h-swish activation) to further improve accuracy and efficiency. It comes in "Large" and "Small" versions targeting different resource constraints.
ShuffleNets (V1, V2)
ShuffleNets focus on optimizing efficiency, considering factors like memory access cost (MAC).
- ShuffleNetV1: Utilized pointwise group convolutions and channel shuffling to reduce computational cost while maintaining information flow across channel groups.
- ShuffleNetV2: Proposed guidelines for practical efficient network design, arguing that minimizing FLOPs alone isn't sufficient. It suggested balancing channel widths, avoiding excessive group convolutions (due to increased MAC), and minimizing element-wise operations. The resulting architecture uses a channel split mechanism and carefully balances operations to achieve better speed in practice.
EfficientNet
EfficientNet introduced compound scaling. Instead of scaling network dimensions (depth, width, resolution) independently, it proposed a principled method to scale them jointly using a compound coefficient ϕ. Starting from a good baseline architecture (EfficientNet-B0, found via NAS), it scales depth (αϕ), width (βϕ), and resolution (γϕ) together, where α,β,γ are constants found via grid search such that α⋅β2⋅γ2≈2. This balanced scaling allows EfficientNets to achieve state-of-the-art accuracy with significantly fewer parameters and FLOPs compared to previous models across a range of computational budgets (B0 to B7).
Design Approaches
When choosing or designing an efficient architecture, consider:
- Target Platform: CPU, GPU, mobile GPU, DSP, or specialized hardware (like TPUs, NPUs) have different performance characteristics. An architecture optimized for one might not be optimal for another (e.g., MAC might be more limiting than FLOPs on certain mobile hardware).
- Latency vs. Throughput: Is real-time inference speed critical (latency), or is processing large batches efficiently more important (throughput)?
- Accuracy Requirements: How much accuracy can be traded for efficiency? MobileNetV3-Small vs. MobileNetV3-Large or EfficientNet-B0 vs. EfficientNet-B7 represent different points on this trade-off curve.
"* Memory Bandwidth: Operations like channel shuffling or concatenation can be memory-intensive, impacting speed."
Designing efficient architectures is an active area of research. By understanding the building blocks like depthwise separable convolutions, group convolutions, and design principles like those used in MobileNets, ShuffleNets, and EfficientNet, you can better select or adapt models for deployment in resource-constrained scenarios, complementing the model compression techniques discussed earlier in this chapter.