Designing Efficient Architectures

While techniques like pruning and quantization modify existing large models to make them smaller and faster, another approach is to design network architectures that are efficient from the ground up. These architectures incorporate building blocks specifically engineered to minimize computational cost (measured in Floating Point Operations or FLOPs) and parameter count, making them suitable for deployment on mobile devices, embedded systems, and other environments with limited resources.

Core Principles of Efficient Architecture Design

The central idea behind many efficient architectures is to replace computationally expensive operations, like standard convolution layers, with cheaper alternatives that approximate their function effectively. Two fundamental techniques stand out:

Depthwise Separable Convolutions: This is perhaps the most influential innovation in efficient CNN design, popularized by MobileNets. A standard $3 \times 3$ convolution operates across all input channels simultaneously to produce each output feature map. Depthwise separable convolution breaks this into two distinct steps:
- Depthwise Convolution: A spatial convolution is applied independently to each input channel. If you have $C_{in}$ input channels, you use $C_{in}$ separate $K \times K$ filters (e.g., $3 \times 3$ ). This step filters spatial information within each channel but doesn't combine information across channels.
- Pointwise Convolution: A $1 \times 1$ convolution is applied across the channels. This step takes the output of the depthwise convolution and combines the information across channels to produce the final $C_{out}$ output channels. It uses $C_{out}$ filters of size $1 \times 1 \times C_{in}$ .
The computational savings are significant. For a standard convolution with $C_{in}$ input channels, $C_{out}$ output channels, filter size $K \times K$ , and feature map size $H \times W$ , the cost is approximately: $\text{Cost}_{\text{Standard}} \approx H \times W \times K \times K \times C_{in} \times C_{out}$ For a depthwise separable convolution: $\text{Cost}_{\text{Depthwise}} \approx H \times W \times K \times K \times C_{in} \quad (\text{Depthwise step})$ $\text{Cost}_{\text{Pointwise}} \approx H \times W \times 1 \times 1 \times C_{in} \times C_{out} \quad (\text{Pointwise step})$ $\text{Cost}_{\text{Separable}} = \text{Cost}_{\text{Depthwise}} + \text{Cost}_{\text{Pointwise}}$ $\text{Cost}_{\text{Separable}} \approx H \times W \times C_{in} \times (K \times K + C_{out})$ The reduction factor compared to standard convolution is roughly: $\frac{\text{Cost}_{\text{Separable}}}{\text{Cost}_{\text{Standard}}} \approx \frac{H \times W \times C_{in} \times (K \times K + C_{out})}{H \times W \times K \times K \times C_{in} \times C_{out}} = \frac{1}{C_{out}} + \frac{1}{K^2}$ For typical values like $K=3$ and large $C_{out}$ , the cost reduction is often around $8-9 \times$ .

Comparison between standard convolution and depthwise separable convolution operations. The separable version breaks the process into two computationally cheaper steps.
Group Convolutions: Introduced initially in AlexNet to handle memory limitations, group convolutions divide the input channels into several groups. Convolutions are then performed independently within each group. If you divide $C_{in}$ channels into $G$ groups, each convolution operates on $C_{in}/G$ input channels to produce $C_{out}/G$ output channels. These are then concatenated. This reduces the parameter count and computation by a factor of $G$ . Depthwise convolution is an extreme case where the number of groups equals the number of input channels ( $G=C_{in}$ ).
Channel Shuffling: Used notably in ShuffleNets, this operation helps information flow between channel groups when using group convolutions. After a group convolution, the channels in the output feature map are "shuffled" or rearranged before being fed into the next group convolution. This ensures that the next layer can process information originating from different groups in the previous layer, mitigating the potential isolation caused by group convolutions.

Notable Efficient Architectures

MobileNets (V1, V2, V3)

The MobileNet family pioneered the large-scale use of depthwise separable convolutions.

MobileNetV1: Directly replaced most standard convolutions with depthwise separable ones. Introduced width (channel count) and resolution multipliers as hyperparameters to easily trade off accuracy and latency/size.
MobileNetV2: Introduced the inverted residual block with linear bottlenecks. Residual connections, similar to ResNet, help gradient flow in deep networks. The "inverted" structure means the block first uses a $1 \times 1$ convolution to expand the channel dimension, applies the lightweight $3 \times 3$ depthwise convolution in the expanded space, and then uses a $1 \times 1$ linear convolution (without ReLU) to project it back down. This was found to prevent information loss in the narrow layers.
MobileNetV3: Incorporated ideas from Neural Architecture Search (NAS), Squeeze-and-Excitation (SE) modules (a form of channel attention), and an updated block structure (using h-swish activation) to further improve accuracy and efficiency. It comes in "Large" and "Small" versions targeting different resource constraints.

ShuffleNets (V1, V2)

ShuffleNets focus on optimizing efficiency by addressing factors like memory access cost (MAC).

ShuffleNetV1: Utilized pointwise group convolutions and channel shuffling to reduce computational cost while maintaining information flow across channel groups.
ShuffleNetV2: Proposed guidelines for practical efficient network design, arguing that minimizing FLOPs alone isn't sufficient. It suggested balancing channel widths, avoiding excessive group convolutions (due to increased MAC), and minimizing element-wise operations. The resulting architecture uses a channel split mechanism and carefully balances operations to achieve better speed in practice.

EfficientNet

EfficientNet introduced compound scaling. Instead of scaling network dimensions (depth, width, resolution) independently, it proposed a principled method to scale them jointly using a compound coefficient $\phi$ . Starting from a good baseline architecture (EfficientNet-B0, found via NAS), it scales depth ( $\alpha^\phi$ ), width ( $\beta^\phi$ ), and resolution ( $\gamma^\phi$ ) together, where $\alpha, \beta, \gamma$ are constants found via grid search such that $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ . This balanced scaling allows EfficientNets to achieve state-of-the-art accuracy with significantly fewer parameters and FLOPs compared to previous models across a range of computational budgets (B0 to B7).

Design Approaches

When choosing or designing an efficient architecture, focus on:

Target Platform: CPU, GPU, mobile GPU, DSP, or specialized hardware (like TPUs, NPUs) have different performance characteristics. An architecture optimized for one might not be optimal for another (e.g., MAC might be more limiting than FLOPs on certain mobile hardware).
Latency vs. Throughput: Is real-time inference speed critical (latency), or is processing large batches efficiently more important (throughput)?
Accuracy Requirements: How much accuracy can be traded for efficiency? MobileNetV3-Small vs. MobileNetV3-Large or EfficientNet-B0 vs. EfficientNet-B7 represent different points on this trade-off curve. "* Memory Bandwidth: Operations like channel shuffling or concatenation can be memory-intensive, impacting speed."

Designing efficient architectures is an active area of research. By understanding the building blocks like depthwise separable convolutions, group convolutions, and design principles like those used in MobileNets, ShuffleNets, and EfficientNet, you can better select or adapt models for deployment in resource-constrained scenarios, complementing the model compression techniques discussed earlier in this chapter.

Was this section helpful?

References

MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE) DOI: 10.1109/CVPR.2018.00474 - Presents the inverted residual block with linear bottlenecks, a key innovation for further improving the efficiency and accuracy of mobile convolutional networks.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan and Quoc V. Le, 2019 Proceedings of the 36th International Conference on Machine Learning (ICML), Vol. 97 (PMLR (Proceedings of Machine Learning Research)) DOI: 10.5555/3306121.3306233 - Proposes a principled compound scaling method to uniformly scale network depth, width, and resolution, achieving state-of-the-art accuracy with significantly fewer resources.