As introduced earlier, overfitting occurs when a model learns the training data too well, capturing noise and specific patterns that don't generalize to new data. A common strategy to combat this is regularization, which involves adding a penalty to the model's loss function based on the complexity of the model itself. L1 and L2 regularization are two widely used techniques that penalize model complexity based on the magnitude of the network's weights.
The core idea is that models with excessively large weights might be too sensitive to small changes in the input features, essentially memorizing the training data. By adding a penalty term to the loss function that increases with the size of the weights, we encourage the optimization process (like gradient descent) to find solutions that not only minimize the original prediction error but also keep the weights relatively small.
L2 regularization adds a penalty proportional to the square of the magnitude of the weights. The modified loss function looks like this:
New Loss=Original Loss+λi∑wi2Here:
How it Works: The L2 penalty term λ∑wi2 discourages large individual weights. During backpropagation, the gradient calculation for each weight wi will include an additional term proportional to wi itself (2λwi). This means that during the weight update step of gradient descent (wi←wi−learning_rate×gradient), there's an extra push pulling the weight towards zero. This effect is why L2 regularization is often called weight decay.
Effect:
The Role of λ: The hyperparameter λ controls the trade-off between minimizing the original loss (fitting the data well) and minimizing the weight magnitudes (keeping the model simple).
Choosing the right value for λ typically involves experimentation and techniques like cross-validation.
L2 regularization tends to cluster weights closer to zero compared to an unregularized model.
L1 regularization adds a penalty proportional to the absolute value of the weights:
New Loss=Original Loss+λi∑∣wi∣Here:
How it Works: The L1 penalty λ∑∣wi∣ also discourages large weights. However, the gradient contribution from the L1 term is proportional to the sign of the weight (λ×sign(wi)), assuming wi=0. This means that during optimization, each weight is pushed towards zero by a constant amount (determined by λ and the learning rate), regardless of its current magnitude (unlike L2 where the push decreases as the weight approaches zero).
Effect:
Shapes of L1 (diamond) and L2 (circle) constraint regions. Optimization finds a point where the loss function contour touches the constraint. L1's sharp corners make intersections along the axes (where one weight is zero) more probable.
The Role of λ: Similar to L2, λ in L1 controls the strength of the regularization. A larger λ leads to more weights being driven to zero, resulting in a sparser model.
In some cases, Elastic Net regularization, which combines both L1 and L2 penalties, is used to get benefits from both approaches.
Most deep learning frameworks make adding L1 or L2 regularization straightforward. For L2 (weight decay), it's often a simple parameter in the optimizer:
# Example in PyTorch using AdamW optimizer for L2 regularization (weight decay)
import torch
import torch.optim as optim
# Assume 'model' is your defined neural network
# learning_rate = 1e-3
# regularization_strength = 1e-4 # This is lambda for L2
# AdamW optimizer incorporates weight decay directly
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
# --- Training loop would go here ---
# loss.backward()
# optimizer.step()
# ...
For L1, or sometimes for more granular control over L2, you might add the penalty term directly to your calculated loss before calling loss.backward()
.
Regularization techniques like L1 and L2 are important tools for preventing overfitting. By adding a penalty based on weight magnitudes to the loss function, they encourage simpler models that often generalize better to unseen data. Remember that the regularization strength λ is a hyperparameter that usually requires tuning for optimal performance.
© 2025 ApX Machine Learning