Now that we understand the intuition behind penalizing large weights, let's formalize L2 regularization mathematically. The core idea is to modify the model's objective function, the function we aim to minimize during training, by adding a term that represents the penalty for large weights.
Typically, when training a neural network, we minimize a loss function, often denoted as J(θ), which measures the discrepancy between the model's predictions and the true target values. Common examples include Mean Squared Error (MSE) for regression or Cross-Entropy Loss for classification. Here, θ represents all the learnable parameters (weights and biases) in the network.
L2 regularization adds a penalty term to this original loss function. This penalty is proportional to the squared magnitude of the weights. For a network with weight matrices W[1],W[2],...,W[L] for layers 1 to L, the L2 regularization term is calculated as the sum of the squared Frobenius norms of these weight matrices:
L2 Penalty=2mλl=1∑L∣∣W[l]∣∣F2Let's break this down:
The new, regularized loss function, let's call it Jreg(θ), becomes:
Jreg(θ)=J(θ)+2mλl=1∑L∣∣W[l]∣∣F2Note: Typically, bias terms (like b[l]) are not included in the regularization penalty. While regularizing biases is possible, it often has a negligible effect on model complexity compared to regularizing weights and is commonly omitted in practice.
How does this additional term affect the training process via gradient descent? During backpropagation, we calculate the gradient of the loss function with respect to each parameter. With the L2 penalty added, the gradient of the regularized loss Jreg with respect to a specific weight wij[l] (weight connecting neuron j in layer l−1 to neuron i in layer l) is:
∂wij[l]∂Jreg=∂wij[l]∂J(θ)+∂wij[l]∂(2mλk=1∑L∣∣W[k]∣∣F2)The derivative of the original loss term ∂wij[l]∂J(θ) is calculated as usual via backpropagation. The derivative of the L2 penalty term with respect to wij[l] is:
∂wij[l]∂(2mλk=1∑Lp∑q∑(wpq[k])2)=2mλ∂wij[l]∂(wij[l])2=2mλ(2wij[l])=mλwij[l](Note how the factor of 2 cancelled out conveniently).
So, the full gradient for a weight wij[l] is:
∂wij[l]∂Jreg=∂wij[l]∂J(θ)+mλwij[l]Now, let's look at the gradient descent update rule for this weight, using a learning rate α:
wij[l]:=wij[l]−α∂wij[l]∂JregSubstituting the gradient we found:
wij[l]:=wij[l]−α(∂wij[l]∂J(θ)+mλwij[l])We can rearrange this slightly:
wij[l]:=wij[l]−αmλwij[l]−α∂wij[l]∂J(θ) wij[l]:=wij[l](1−αmλ)−α∂wij[l]∂J(θ)Look closely at the term (1−αmλ). Since the learning rate α and the regularization strength λ are positive, and m is the batch size (also positive), this factor is slightly less than 1 (assuming αλ/m is small enough, which it usually is for stable training).
This equation reveals why L2 regularization is often called weight decay. In each update step, before applying the gradient from the original loss, the weight wij[l] is multiplied by a factor slightly less than 1. This effectively shrinks or "decays" the weight towards zero at each step. The larger the weight, the larger the decay effect (due to the mλwij[l] term in the gradient). This mechanism discourages weights from growing too large, fulfilling the intuition we discussed earlier: keeping weights small helps to simplify the model and improve generalization.
The hyperparameter λ directly controls the rate of this decay. A higher λ leads to faster decay and smaller resulting weights. Finding the right value for λ is a process of hyperparameter tuning, often done using a validation set.
© 2025 ApX Machine Learning