As we saw earlier, overfitting occurs when a network learns the training data too well, including its noise and specific idiosyncrasies, failing to generalize to new data. One effective strategy to combat this is regularization, which involves adding a penalty to the loss function based on the magnitude of the network's weights. The intuition is that complex models often have large weights, while simpler, more generalizable models tend to have smaller weights. By penalizing large weights, we encourage the network to find simpler solutions.
Two common and powerful regularization techniques directly inspired by linear models are L1 and L2 regularization.
L2 regularization adds a penalty proportional to the square of the value of each weight (w) to the original loss function (Loriginal). The modified loss function, LL2, becomes:
LL2=Loriginal+2mλi∑wi2Here:
How it works: During backpropagation, the gradient of this penalty term with respect to a weight wi is mλwi. When updating the weights using gradient descent, this term gets subtracted:
wi:=wi−α(∂wi∂Loriginal+mλwi)Notice the term −αmλwi. This term effectively pushes the weight wi slightly towards zero in each update step, proportional to its current value. This is why L2 regularization is often called Weight Decay. It encourages the network to use smaller weights, distributing the importance across many neurons rather than relying heavily on a few large weights. This generally leads to smoother decision boundaries and better generalization.
The quadratic penalty (w2) imposed by L2 regularization. The penalty grows significantly as weights move away from zero, strongly discouraging large weight values.
L1 regularization takes a different approach. It adds a penalty proportional to the absolute value of each weight to the loss function:
LL1=Loriginal+mλi∑∣wi∣Here, ∑i∣wi∣ is the sum of the absolute values of all weights, and λ again controls the penalty strength.
How it works: The gradient of the L1 penalty term is mλsign(wi), where sign(wi) is +1 if wi is positive, -1 if wi is negative, and 0 if wi is zero (though the gradient is technically undefined at zero, implementations handle this). The weight update looks like:
wi:=wi−α(∂wi∂Loriginal+mλsign(wi))The key difference from L2 is that L1 subtracts a constant factor (αmλ) pushing the weight towards zero, regardless of the weight's current magnitude (as long as it's not zero). This constant push can make weights become exactly zero and stay there. Consequently, L1 regularization often leads to sparse models, where many weights are zero. This can be interpreted as a form of automatic feature selection, as neurons connected by zero weights effectively become inactive for those inputs.
The linear penalty (∣w∣) imposed by L1 regularization. The penalty increases linearly as weights move away from zero. The constant gradient encourages sparsity by pushing small weights directly to zero.
In practice, both L1 and L2 regularization can be added directly within most deep learning frameworks when defining layers or optimizers.
The choice of the regularization hyperparameter λ is important.
Finding a good value for λ typically requires experimentation. It's usually tuned using the validation set, often through techniques like grid search or random search, which we will discuss later in this chapter under hyperparameter tuning.
By adding these penalties, L1 and L2 regularization provide effective mechanisms to control model complexity, discourage overfitting, and improve the generalization performance of your neural networks on unseen data. They are fundamental tools in the deep learning practitioner's toolkit.
© 2025 ApX Machine Learning