Okay, let's formalize the intuition behind L1 regularization. We modify the standard loss function by adding a penalty term proportional to the sum of the absolute values of all the weights in the network.
Recall that our goal during training is typically to minimize a data loss function, let's call it Ldata(W), which measures how well the model's predictions match the actual target values given the current weights W. With L1 regularization, we add a new term to this objective:
Ltotal(W)=Ldata(W)+λi∑∣wi∣Let's break down this new term:
The core difference compared to L2 regularization lies in the use of the absolute value ∣wi∣ instead of the squared value wi2. This seemingly small change has a significant impact on the optimization process.
During backpropagation, we need to compute the gradient of the total loss Ltotal with respect to each weight wj to update it. The gradient of the L1 penalty term with respect to a specific weight wj is:
∂wj∂(λi∑∣wi∣)=λ⋅∂wj∂∣wj∣The derivative of the absolute value function ∣x∣ is sign(x), which is:
So, the gradient of the total loss becomes:
∂wj∂Ltotal=∂wj∂Ldata+λ⋅sign(wj)for wj=0This means the L1 penalty adds a constant value (λ or −λ) to the gradient of the data loss, pushing the weight towards zero regardless of the weight's current magnitude (as long as it's not zero). This constant push is what drives weights to become exactly zero. Compare this to L2 regularization, where the penalty's contribution to the gradient (λ⋅2wj) decreases as the weight gets smaller, making it less likely to reach precisely zero.
What happens when wj=0? The absolute value function isn't differentiable at this point. In practice, optimization algorithms handle this using techniques related to subgradient descent or proximal gradient methods. The subgradient of ∣wj∣ at wj=0 is the interval [−1,1]. A common practical approach is to simply set the gradient contribution from the L1 term to zero if the weight is already zero, or to apply a "soft thresholding" operation during the update step, which effectively checks if the update would cross zero and sets the weight to zero if it does. Most deep learning framework optimizers handle this detail internally.
The essential takeaway is that the L1 penalty provides a constant "push" towards zero for non-zero weights, making it highly effective at producing sparse models where many weights become exactly zero.
The L1 penalty forms a 'V' shape, applying a constant gradient magnitude (slope) regardless of the weight's size when it's non-zero. The L2 penalty forms a parabola, with a gradient that decreases as the weight approaches zero.
This mathematical structure directly leads to the sparsity-inducing property discussed earlier. In the next sections, we'll compare L1 and L2 more directly and look at how to implement them in code.
© 2025 ApX Machine Learning