In the previous chapter, we saw how models can sometimes fit the training data too well, capturing noise and specific patterns that don't exist in the broader dataset. This overfitting hurts the model's ability to generalize. Weight regularization offers a direct way to combat this by controlling the complexity of the model itself. The intuition is straightforward: overly complex models often rely on very large weight values to precisely fit the training examples, including the noise. By discouraging large weights, we encourage simpler models.
L2 regularization, perhaps the most common form of weight regularization, implements this idea by adding a penalty to the model's loss function. This penalty is proportional to the sum of the squares of all the weights in the network. It's also frequently referred to as Weight Decay.
Let's consider a standard loss function, like Mean Squared Error (MSE) or Cross-Entropy, which we'll denote as Loriginal(W), where W represents all the weights in our network. L2 regularization modifies this objective by adding a penalty term:
Ltotal(W)=Loriginal(W)+2λi∑wi2Here:
How does this new loss function change the training process? During backpropagation, we calculate gradients of the loss function with respect to the weights to update them using an optimizer like SGD. The gradient of the total loss Ltotal with respect to a specific weight wj now includes an extra term derived from the regularization penalty:
∂wj∂Ltotal=∂wj∂Loriginal+λwjLook closely at the second term, λwj. When we update the weight wj using gradient descent with learning rate η (eta), the update rule becomes:
wj←wj−η(∂wj∂Loriginal+λwj)Rearranging this slightly gives:
wj←wj−η∂wj∂Loriginal−ηλwj wj←(1−ηλ)wj−η∂wj∂LoriginalThis final form reveals why L2 regularization is called Weight Decay. In each update step, before subtracting the original gradient component (η∂wj∂Loriginal), the weight wj is multiplied by (1−ηλ). Since both the learning rate η and the regularization strength λ are positive, this factor is slightly less than 1. Therefore, each weight is effectively shrunk or decayed towards zero during every update, in addition to being adjusted based on the original loss gradient.
The practical consequence of this weight decay is that the optimization process favors solutions where weight magnitudes are smaller and more evenly distributed. It discourages any single weight from growing excessively large. Models with smaller weights tend to exhibit smoother behavior. Imagine fitting a curve to data points; a model with large weights might produce a very "wiggly" curve that hits every training point perfectly but behaves erratically elsewhere. A model trained with L2 regularization would prefer a smoother curve that captures the general trend, even if it doesn't fit every training point exactly. This smoothness often translates directly to better generalization on unseen data.
The strength of this effect is controlled by the hyperparameter λ.
In summary, L2 regularization (Weight Decay) modifies the loss function to penalize large squared weights. This alters the gradient update rule, causing weights to shrink towards zero during training, leading to simpler, smoother models less prone to overfitting.
© 2025 ApX Machine Learning