APX AI
Online
Models can sometimes fit the training data too well, capturing noise and specific patterns that do not exist in the general dataset. This phenomenon, known as overfitting, significantly impairs a model's ability to generalize to new, unseen data. Weight regularization offers a direct way to combat this by controlling the complexity of the model itself. The intuition is straightforward: overly complex models often rely on very large weight values to precisely fit the training examples, including the noise. By discouraging large weights, we encourage simpler models.
L2 regularization, perhaps the most common form of weight regularization, implements this idea by adding a penalty to the model's loss function. This penalty is proportional to the sum of the squares of all the weights in the network. While often used interchangeably with the term Weight Decay, the two are technically distinct concepts that happen to be equivalent under specific optimization conditions.
Let's examine a standard loss function, like Mean Squared Error (MSE) or Cross-Entropy, which we'll denote as , where represents all the weights in our network. L2 regularization modifies this objective by adding a penalty term:
Here:
How does this new loss function change the training process? During backpropagation, we calculate gradients of the loss function with respect to the weights to update them using an optimizer. The gradient of the total loss with respect to a specific weight now includes an extra term derived from the regularization penalty:
When we update the weight using standard Stochastic Gradient Descent (SGD) with learning rate (eta), the update rule becomes:
Rearranging this slightly gives:
In this specific case, L2 regularization results in Weight Decay. In each update step, before subtracting the original gradient component (), the weight is multiplied by . Since both the learning rate and the regularization strength are positive, this factor is slightly less than 1. Each weight is effectively shrunk or decayed towards zero during every update.
It is important to note that while L2 regularization (adding a penalty to the loss) and weight decay (shrunk weights during the update) are identical for SGD, they differ when using adaptive optimizers like Adam. In those cases, the L2 penalty is affected by the adaptive learning rates, whereas true weight decay is applied independently of the gradient-based update.
The practical consequence of this mechanism is that the optimization process favors solutions where weight magnitudes are smaller and more evenly distributed. It discourages any single weight from growing excessively large. Models with smaller weights tend to exhibit smoother behavior. Imagine fitting a curve to data points; a model with large weights might produce a very "wiggly" curve that hits every training point perfectly but behaves erratically elsewhere. A model trained with L2 regularization would prefer a smoother curve that captures the general trend, even if it doesn't fit every training point exactly. This smoothness often translates directly to better generalization on unseen data.
The strength of this effect is controlled by the hyperparameter .
In summary, L2 regularization modifies the loss function to penalize large squared weights. For standard SGD, this is mathematically equivalent to weight decay, causing weights to shrink towards zero during training, leading to simpler, smoother models less prone to overfitting.
© 2026 ApX Machine LearningContent Integrity & Transparency•