All Courses

Comparing L1 and L2 Regularization

Now that you understand the motivation behind penalizing large weights, let's compare the two most common weight regularization techniques: L1 and L2 regularization. Both modify the loss function by adding a penalty term, but the nature of this penalty leads to significantly different outcomes.

Recall the standard loss function for training a neural network, often denoted as $L_{data}(\theta)$ , where $\theta$ represents the model's parameters (weights and biases).

L2 Regularization (Weight Decay): Adds a penalty proportional to the square of the magnitude of the weights. $L_{total}(\theta) = L_{data}(\theta) + \frac{\lambda}{2} \sum_{i} w_i^2$ Here, $w_i$ represents each weight in the network (biases are often excluded), and $\lambda$ is the regularization strength hyperparameter. The $\frac{1}{2}$ factor is added for mathematical convenience when taking the derivative. The term $\sum_{i} w_i^2$ is the squared $L_2$ norm of the weight vector.
L1 Regularization: Adds a penalty proportional to the absolute value of the magnitude of the weights. $L_{total}(\theta) = L_{data}(\theta) + \lambda \sum_{i} |w_i|$ Again, $\lambda$ controls the strength of the regularization. The term $\sum_{i} |w_i|$ is the $L_1$ norm of the weight vector.

Impact on Weight Values

The core difference lies in how these penalties affect the weights during gradient descent.

L2 Regularization: The gradient of the L2 penalty term with respect to a weight $w_i$ is $\lambda w_i$ . This means the update rule effectively includes a term that linearly shrinks the weight towards zero: $w_i \leftarrow w_i - \eta (\frac{\partial L_{data}}{\partial w_i} + \lambda w_i)$ . This constant shrinking effect is why L2 is often called "weight decay". It encourages weights to be small and distributed more evenly, rarely forcing them to be exactly zero unless the data gradient perfectly cancels the decay term. It prefers a diffuse distribution of small weights.
L1 Regularization: The gradient of the L1 penalty term with respect to $w_i$ is $\lambda \cdot \text{sign}(w_i)$ , where $\text{sign}(w_i)$ is +1 if $w_i > 0$ , -1 if $w_i < 0$ , and (technically) undefined at $w_i = 0$ . In practice, implementations often use a subgradient (like 0) at $w_i=0$ . The important point is that the penalty subtracts a constant amount ( $\eta \lambda$ ) from the weight's update (or adds it, if the weight is negative), regardless of the weight's current magnitude (as long as it's non-zero). This constant "push" towards zero is much more effective at making weights exactly zero compared to L2's proportional push. This leads to sparse weight vectors, where many weights become zero.

Geometric Interpretation

Imagine a simple scenario with only two weights, $w_1$ and $w_2$ . The regularization term constrains the possible values of these weights.

L2 regularization ( $w_1^2 + w_2^2 \le C$ ) defines a circular constraint region.
L1 regularization ( $|w_1| + |w_2| \le C$ ) defines a diamond-shaped constraint region.

The optimization process tries to find weights within this region that also minimize the original data loss $L_{data}$ . The optimal solution often occurs where the contour lines of the data loss first touch the constraint region.

Geometric interpretation of L1 (diamond) and L2 (circle) constraint regions in a two-dimensional weight space. The corners of the L1 diamond lie on the axes, making solutions where one weight is zero more likely when loss contours intersect the boundary.

As the diagram suggests, the "corners" of the L1 diamond lie on the axes ( $w_1=0$ or $w_2=0$ ). Unless the loss contours happen to be perfectly aligned with the constraint boundary, it's more likely that the intersection point will be at one of these corners, leading to a sparse solution. The L2 circle, being smooth, doesn't have these corners, and intersections tend to occur where both weights are non-zero.

Sparsity and Feature Selection

The most significant practical difference is sparsity. L1 regularization performs implicit feature selection by driving the weights of less important features to exactly zero. If a weight connecting an input feature to the network becomes zero, that feature effectively has no influence on the output for that connection path. This can be advantageous in high-dimensional spaces where many features might be irrelevant or redundant.

L2 regularization makes weights small, but usually not exactly zero. It reduces the influence of all features generally but doesn't eliminate them.

Computational Aspects

L2 regularization is generally computationally simpler. Its penalty term has a smooth derivative, making it straightforward to incorporate into standard gradient descent algorithms. Many optimizers (like Adam, SGD) include a built-in weight_decay parameter which directly implements L2 regularization.

L1 regularization's penalty term is non-differentiable at zero. While this isn't an insurmountable problem (techniques like using subgradients or iterative soft-thresholding exist), it means L1 isn't always a built-in parameter in optimizers like L2's weight_decay. Often, you need to explicitly add the L1 penalty term to your loss calculation before backpropagation, as shown in the code example below.

import torch
import torch.nn as nn
import torch.optim as optim

# Example Layer
linear_layer = nn.Linear(50, 20)
inputs = torch.randn(64, 50) # Batch of data
targets = torch.randn(64, 20)

# --- Applying L2 Regularization (Weight Decay) ---
# Simply use the weight_decay argument in the optimizer
optimizer_l2 = optim.Adam(linear_layer.parameters(), lr=0.001, weight_decay=1e-4)

# --- Applying L1 Regularization ---
# L1 is typically added explicitly to the loss
optimizer_l1 = optim.Adam(linear_layer.parameters(), lr=0.001) # No weight_decay
l1_lambda = 1e-5
criterion = nn.MSELoss() # Example base loss function

# In your training loop:
optimizer_l1.zero_grad()
outputs = linear_layer(inputs)
loss = criterion(outputs, targets)

# Calculate L1 penalty for parameters requiring gradients
l1_penalty = 0
for param in linear_layer.parameters():
    if param.requires_grad:
         l1_penalty += torch.sum(torch.abs(param))

# Add L1 penalty to the base loss
total_loss_l1 = loss + l1_lambda * l1_penalty
total_loss_l1.backward()
optimizer_l1.step()

print(f"Base Loss: {loss.item():.4f}")
print(f"L1 Penalty: {l1_penalty.item():.4f}")
print(f"Total Loss with L1: {total_loss_l1.item():.4f}")

Example code showing how L2 regularization is often handled via the optimizer's weight_decay parameter, while L1 regularization typically involves manually calculating the L1 norm of the weights and adding it to the primary loss function.

When to Use Which?

L2 Regularization:
- Often the default starting point for regularization.
- Generally effective at improving generalization by preventing weights from growing too large.
- Computationally convenient.
- Good when you believe most features contribute at least somewhat to the output.
L1 Regularization:
- Use when you suspect many input features are irrelevant and desire a sparse model (automatic feature selection).
- Can be helpful in very high-dimensional settings (e.g., text analysis with bag-of-words).
- May result in models that are more interpretable due to the reduced number of active features.
Elastic Net: As discussed in the next section, Elastic Net combines L1 and L2. It can be useful when you want some of the feature selection properties of L1 but also the general weight shrinkage of L2, especially when dealing with correlated features.

Both L1 and L2 introduce a hyperparameter, $\lambda$ , which controls the strength of the regularization. This value needs to be tuned, often using techniques like grid search or random search on a validation set, as its optimal value is problem-dependent. Choosing too large a $\lambda$ can lead to underfitting (oversimplifying the model), while too small a $\lambda$ may not provide sufficient regularization.

In summary, while both L1 and L2 regularization aim to simplify models by penalizing large weights, they do so differently, leading to distinct model characteristics. L2 encourages small, diffuse weights (weight decay), while L1 encourages sparse weights (feature selection). Your choice between them depends on the characteristics of your data and whether feature selection is a desired outcome.

Was this section helpful?