While L2 regularization penalizes the squared magnitude of weights, L1 regularization takes a different approach by penalizing the absolute magnitude of the weights. This seemingly small change has a significant consequence: L1 regularization promotes sparsity in the weight vectors, meaning it encourages many weights to become exactly zero.
Similar to L2, L1 regularization adds a penalty term to the original loss function Loriginal(W). The total loss function Ltotal becomes:
Ltotal(W)=Loriginal(W)+λi∑∣wi∣Here, W represents all the weights in the network, wi is an individual weight, and λ is the regularization hyperparameter controlling the strength of the penalty. The term ∑i∣wi∣ is the L1 norm of the weight vector.
During backpropagation, the gradient calculation for each weight wi will include an additional term derived from the L1 penalty. For non-zero weights, this term is simply λ⋅sign(wi), where sign(wi) is +1 if wi is positive and -1 if wi is negative. This means the L1 penalty applies a constant subtraction (λ) from positive weights and a constant addition (λ) to negative weights during each update, effectively pushing them towards zero.
The key difference between L1 and L2 regularization lies in how they influence the optimization process. Imagine trying to minimize the original loss function subject to a constraint on the size of the weights, where the "size" is defined by either the L1 or L2 norm.
Consider the optimization process visually. The optimal weights occur where the level curves of the original loss function first touch the constraint region defined by the regularization penalty.
Comparison of L1 (diamond, pink) and L2 (circle, blue) constraint boundaries for a fixed penalty budget. Optimization seeks a point where the loss contours (gray ellipses, representing lower loss towards the center) first touch the constraint boundary. Because the L1 boundary has sharp corners aligned with the axes, the contact point is often on an axis, meaning one of the weights is exactly zero. The L2 boundary's smoothness makes such exact zero outcomes less probable.
The "corners" of the L1 diamond lie on the axes (where one weight is zero and the others determine the point). Because the loss function contours are often elliptical, they are more likely to first touch the L1 diamond at one of these corners compared to touching the smooth L2 circle at a point where no weight is exactly zero. The constant subtractive/additive nature of the L1 gradient also strongly pushes small weights towards zero and keeps them there.
When a weight wi connected to an input feature xi becomes exactly zero, that feature no longer contributes to the neuron's output (wixi=0). In effect, L1 regularization performs an automatic form of feature selection, identifying and nullifying the influence of less important features.
This can be particularly useful in scenarios with very high-dimensional input data where many features might be irrelevant or redundant. By forcing irrelevant feature weights to zero, L1 regularization can:
However, if all input features are expected to contribute at least somewhat, the aggressive feature selection of L1 might be detrimental, and L2 regularization, which shrinks weights without necessarily zeroing them out, might be preferred. L2 is generally more commonly used as a default regularization technique in deep learning, while L1 is considered when sparsity is a specific goal.
Implementing L1 regularization is typically straightforward in deep learning frameworks, often involving setting a parameter in the layer definition or the optimizer, as we will see in the practical sections.
© 2025 ApX Machine Learning