Now that you understand the motivation behind penalizing large weights, let's compare the two most common weight regularization techniques: L1 and L2 regularization. Both modify the loss function by adding a penalty term, but the nature of this penalty leads to significantly different outcomes.
Recall the standard loss function for training a neural network, often denoted as Ldata(θ), where θ represents the model's parameters (weights and biases).
L2 Regularization (Weight Decay): Adds a penalty proportional to the square of the magnitude of the weights. Ltotal(θ)=Ldata(θ)+2λ∑iwi2 Here, wi represents each weight in the network (biases are often excluded), and λ is the regularization strength hyperparameter. The 21 factor is added for mathematical convenience when taking the derivative. The term ∑iwi2 is the squared L2 norm of the weight vector.
L1 Regularization: Adds a penalty proportional to the absolute value of the magnitude of the weights. Ltotal(θ)=Ldata(θ)+λ∑i∣wi∣ Again, λ controls the strength of the regularization. The term ∑i∣wi∣ is the L1 norm of the weight vector.
The core difference lies in how these penalties affect the weights during gradient descent.
L2 Regularization: The gradient of the L2 penalty term with respect to a weight wi is λwi. This means the update rule effectively includes a term that linearly shrinks the weight towards zero: wi←wi−η(∂wi∂Ldata+λwi). This constant shrinking effect is why L2 is often called "weight decay". It encourages weights to be small and distributed more evenly, rarely forcing them to be exactly zero unless the data gradient perfectly cancels the decay term. It prefers a diffuse distribution of small weights.
L1 Regularization: The gradient of the L1 penalty term with respect to wi is λ⋅sign(wi), where sign(wi) is +1 if wi>0, -1 if wi<0, and (technically) undefined at wi=0. In practice, implementations often use a subgradient (like 0) at wi=0. The key is that the penalty subtracts a constant amount (ηλ) from the weight's update (or adds it, if the weight is negative), regardless of the weight's current magnitude (as long as it's non-zero). This constant "push" towards zero is much more effective at making weights exactly zero compared to L2's proportional push. This leads to sparse weight vectors, where many weights become zero.
Imagine a simple scenario with only two weights, w1 and w2. The regularization term constrains the possible values of these weights.
The optimization process tries to find weights within this region that also minimize the original data loss Ldata. The optimal solution often occurs where the contour lines of the data loss first touch the constraint region.
Geometric interpretation of L1 (diamond) and L2 (circle) constraint regions in a two-dimensional weight space. The corners of the L1 diamond lie on the axes, making solutions where one weight is zero more likely when loss contours intersect the boundary.
As the diagram suggests, the "corners" of the L1 diamond lie on the axes (w1=0 or w2=0). Unless the loss contours happen to be perfectly aligned with the constraint boundary, it's more likely that the intersection point will be at one of these corners, leading to a sparse solution. The L2 circle, being smooth, doesn't have these corners, and intersections tend to occur where both weights are non-zero.
The most significant practical difference is sparsity. L1 regularization performs implicit feature selection by driving the weights of less important features to exactly zero. If a weight connecting an input feature to the network becomes zero, that feature effectively has no influence on the output for that connection path. This can be advantageous in high-dimensional spaces where many features might be irrelevant or redundant.
L2 regularization makes weights small, but usually not exactly zero. It reduces the influence of all features generally but doesn't eliminate them.
L2 regularization is generally computationally simpler. Its penalty term has a smooth derivative, making it straightforward to incorporate into standard gradient descent algorithms. Many optimizers (like Adam, SGD) include a built-in weight_decay
parameter which directly implements L2 regularization.
L1 regularization's penalty term is non-differentiable at zero. While this isn't an insurmountable problem (techniques like using subgradients or iterative soft-thresholding exist), it means L1 isn't always a built-in parameter in optimizers like L2's weight_decay
. Often, you need to explicitly add the L1 penalty term to your loss calculation before backpropagation, as shown in the code example below.
import torch
import torch.nn as nn
import torch.optim as optim
# Example Layer
linear_layer = nn.Linear(50, 20)
inputs = torch.randn(64, 50) # Batch of data
targets = torch.randn(64, 20)
# --- Applying L2 Regularization (Weight Decay) ---
# Simply use the weight_decay argument in the optimizer
optimizer_l2 = optim.Adam(linear_layer.parameters(), lr=0.001, weight_decay=1e-4)
# --- Applying L1 Regularization ---
# L1 is typically added explicitly to the loss
optimizer_l1 = optim.Adam(linear_layer.parameters(), lr=0.001) # No weight_decay
l1_lambda = 1e-5
criterion = nn.MSELoss() # Example base loss function
# In your training loop:
optimizer_l1.zero_grad()
outputs = linear_layer(inputs)
loss = criterion(outputs, targets)
# Calculate L1 penalty for parameters requiring gradients
l1_penalty = 0
for param in linear_layer.parameters():
if param.requires_grad:
l1_penalty += torch.sum(torch.abs(param))
# Add L1 penalty to the base loss
total_loss_l1 = loss + l1_lambda * l1_penalty
total_loss_l1.backward()
optimizer_l1.step()
print(f"Base Loss: {loss.item():.4f}")
print(f"L1 Penalty: {l1_penalty.item():.4f}")
print(f"Total Loss with L1: {total_loss_l1.item():.4f}")
Example code showing how L2 regularization is often handled via the optimizer's
weight_decay
parameter, while L1 regularization typically involves manually calculating the L1 norm of the weights and adding it to the primary loss function.
L2 Regularization:
L1 Regularization:
Elastic Net: As discussed in the next section, Elastic Net combines L1 and L2. It can be useful when you want some of the feature selection properties of L1 but also the general weight shrinkage of L2, especially when dealing with correlated features.
Both L1 and L2 introduce a hyperparameter, λ, which controls the strength of the regularization. This value needs to be tuned, often using techniques like grid search or random search on a validation set, as its optimal value is problem-dependent. Choosing too large a λ can lead to underfitting (oversimplifying the model), while too small a λ may not provide sufficient regularization.
In summary, while both L1 and L2 regularization aim to simplify models by penalizing large weights, they do so differently, leading to distinct model characteristics. L2 encourages small, diffuse weights (weight decay), while L1 encourages sparse weights (feature selection). Your choice between them depends on the characteristics of your data and whether feature selection is a desired outcome.
© 2025 ApX Machine Learning