While standard regularization methods like L2 weight decay and basic Dropout are fundamental tools for combating overfitting, training the very deep and complex CNNs discussed in this course often benefits from more refined approaches. As models grow deeper and wider, they gain expressive capacity but also become more susceptible to overfitting the training data and sensitive to the specifics of the training process. This section examines advanced regularization techniques designed to improve the generalization capabilities of modern CNNs.
Standard Dropout randomly sets a fraction of neuron activations to zero during training. While effective in fully connected layers, its application in convolutional layers, where adjacent pixels or feature map activations often share correlated information, can be suboptimal. Simply dropping individual activations might not introduce sufficient noise or prevent co-adaptation effectively due to the strong spatial structure. Advanced Dropout variants address this.
Instead of dropping individual activation values, Spatial Dropout (sometimes called 2D Dropout) randomly drops entire feature maps (channels) during training. If a specific channel is selected for dropout, all activations within that feature map are set to zero.
Consider a feature map tensor with shape (batch_size, height, width, channels)
. Standard Dropout would operate independently on each element within the height * width * channels
dimensions. Spatial Dropout, however, applies the same dropout mask across the height
and width
dimensions for a given channel.
Comparison showing standard dropout acting pixel-wise versus spatial dropout acting channel-wise. Gray squares indicate dropped units or channels.
This approach encourages the network to learn redundant representations across different feature maps, making it more resilient to the absence of entire channels and better suited for convolutional layers processing spatially correlated data.
DropConnect is another variation where, instead of zeroing out activations (outputs of neurons), it randomly sets weights within the network to zero during the forward pass. Each connection between layers has a probability of being dropped.
While Dropout affects the activations (a) in a layer (y=a(Wx+b)), DropConnect applies a mask (M) directly to the weights (W) and biases (b):
y=a((MW⊙W)x+(Mb⊙b))Here, ⊙ represents element-wise multiplication, and MW,Mb are binary masks sampled for each training example. DropConnect can be seen as a more general form of regularization compared to Dropout, potentially introducing more noise and requiring careful tuning. It's computationally more intensive than standard Dropout as it requires sampling different masks for weights for each example, rather than just for activations.
Classification models are typically trained using one-hot encoded labels and a cross-entropy loss function. This encourages the model to produce output probabilities that are extremely close to 1 for the correct class and 0 for all incorrect classes. For example, for a 3-class problem with the true label being class 1, the target probability vector is [1,0,0]. The model is penalized heavily if it assigns even a small probability to the incorrect classes.
While this seems intuitive, it can lead to issues:
Label Smoothing addresses this by replacing the hard 0 and 1 targets with "softer" probabilities. Instead of demanding a probability of 1.0 for the correct class, we assign it a target probability slightly less than 1, like 1−α. The remaining probability mass α is distributed evenly among the incorrect classes.
For a classification problem with K classes, if the original one-hot target for an example is yk=1 for the true class ktrue and yk=0 for k=ktrue, the smoothed label yk′ becomes:
yk′=yk(1−α)+KαLet's illustrate with an example. Suppose we have K=5 classes, the true class is index 2, and we use a smoothing factor α=0.1.
When training with cross-entropy loss using these smoothed targets, the model is discouraged from producing extremely large logit values for the correct class relative to the others. It encourages finite differences between the logits, leading to a model that is better calibrated (its confidence scores are more indicative of actual likelihood) and often generalizes slightly better.
A typical value for the smoothing factor α is 0.1, but this can be tuned as a hyperparameter. LSR is widely used in training state-of-the-art image classification models.
It's important to note that these advanced regularization techniques interact with other components of the training process. For instance:
Choosing the right combination and strength of regularization techniques often involves experimentation. Monitoring validation loss and accuracy is essential to find the configuration that prevents overfitting without excessively hindering the model's ability to learn from the training data. These advanced methods provide valuable additions to the deep learning practitioner's toolkit for building more effective and reliable CNN models.
© 2025 ApX Machine Learning