Previous sections discussed architectures like ResNet and DenseNet, which achieved significant performance gains primarily by increasing network depth and improving gradient flow. However, simply making networks deeper or wider doesn't always lead to the best performance or efficiency. The question arose: Is there a more principled way to scale up Convolutional Neural Networks (CNNs) to achieve better accuracy and efficiency? EfficientNet provides a compelling answer through a strategy called compound scaling.
Traditionally, CNNs have been scaled up in one of three dimensions:
Scaling only one of these dimensions often leads to diminishing returns. For instance, making a network excessively deep without increasing width or input resolution might struggle to capture sufficient spatial detail. Similarly, widening a shallow network might not leverage the potential of complex feature hierarchies. The authors of EfficientNet observed that these dimensions are interdependent and should be balanced for optimal results.
The central idea behind EfficientNet is compound scaling. Instead of arbitrarily scaling one dimension, EfficientNet proposes scaling network depth (d), width (w), and image resolution (r) simultaneously using a fixed set of scaling coefficients.
The scaling is controlled by a single compound coefficient, ϕ. Given a baseline network (EfficientNet-B0), scaling up involves increasing ϕ. The depth, width, and resolution are scaled according to these rules:
Subject to the constraint:
α⋅β2⋅γ2≈2Here, α, β, and γ are constants determined through a grid search on the baseline model. They represent how much to scale each dimension. The constraint α⋅β2⋅γ2≈2 ensures that for every increase in ϕ, the total floating-point operations per second (FLOPS) required increases by approximately 2ϕ.
The intuition is that if the input image resolution (r) is increased, the network needs more layers (depth d) to increase the receptive field and more channels (width w) to capture finer patterns in the larger input image. Compound scaling provides a systematic way to balance these factors.
The effectiveness of compound scaling relies on having a strong, efficient baseline architecture. EfficientNet uses a baseline network (EfficientNet-B0) found through Neural Architecture Search (NAS). This search optimized for both accuracy and FLOPS.
The core building block of EfficientNet-B0 is the Mobile Inverted Bottleneck Convolution (MBConv), similar to the one used in MobileNetV2, but potentially enhanced with Squeeze-and-Excitation (SE) optimization. An MBConv block typically includes:
By starting with EfficientNet-B0 and applying the compound scaling rules with increasing values of the compound coefficient ϕ (typically integer values starting from 1), a family of models (EfficientNet-B1, B2, ..., B7, etc.) is generated. Each subsequent model uses approximately twice the FLOPS of the previous one but aims for higher accuracy while maintaining high parameter efficiency.
Comparison showing how EfficientNet models (blue) typically achieve higher accuracy for a given amount of computation compared to models scaled conventionally (gray).
EfficientNets demonstrated state-of-the-art performance on ImageNet and several other transfer learning benchmarks upon their release, often achieving comparable accuracy to much larger models with significantly fewer parameters and lower computational cost (FLOPS).
Some considerations include:
EfficientNet represents a significant step in designing effective and scalable CNN architectures. By carefully balancing depth, width, and resolution through compound scaling, it provides a powerful framework for developing models that push the boundaries of accuracy and efficiency in computer vision. Pre-trained versions are widely available in frameworks like TensorFlow (via tf.keras.applications.EfficientNetB0
to B7
) and PyTorch (often via third-party libraries like timm
), making them readily usable for practical applications.
© 2025 ApX Machine Learning