We've seen how L1 regularization encourages sparsity by driving some weights exactly to zero, effectively performing feature selection. We've also explored L2 regularization, often called weight decay, which prefers smaller weights overall and tends to shrink groups of correlated features together without necessarily making them zero.
What if you want a combination of these effects? Perhaps you suspect many features are irrelevant (suggesting L1), but you also believe there might be groups of correlated features that are important, and you don't want L1 to arbitrarily pick just one from the group (a scenario where L2 excels). This is where Elastic Net regularization comes in.
Elastic Net linearly combines the L1 and L2 penalties. Instead of just adding one type of penalty to the original loss function Lorig(w), we add both:
Lelastic(w)=Lorig(w)+λ1∣∣w∣∣1+λ2∣∣w∣∣22Here, ∣∣w∣∣1 is the L1 norm (sum of absolute values of weights) and ∣∣w∣∣22 is the squared L2 norm (sum of squared weights). The hyperparameters λ1≥0 and λ2≥0 control the strength of the L1 and L2 penalties, respectively.
Often, you'll see this formulation parameterized differently in libraries, making it easier to tune the overall strength versus the mix:
Lelastic(w)=Lorig(w)+α(ρ∣∣w∣∣1+21−ρ∣∣w∣∣22)In this common parameterization:
l1_ratio
in libraries) is a hyperparameter between 0 and 1 that controls the mix between L1 and L2.
Elastic Net aims to inherit the best of both worlds:
We can think of regularization as constraining the possible weight values. L1 regularization corresponds to a diamond-shaped constraint region (in 2D), while L2 corresponds to a circular region. Elastic Net creates a constraint region that blends these two shapes, having rounded corners but still favoring solutions along the axes compared to pure L2.
Comparison of the constraint regions for L1, L2, and Elastic Net regularization in two dimensions. Elastic Net combines the sharp corners of L1 (encouraging sparsity) with the roundness of L2.
Using Elastic Net introduces two hyperparameters to tune: the overall strength α and the mixing ratio ρ (l1_ratio
). Finding the optimal combination usually requires techniques like grid search or random search over possible values for both parameters, using a validation set to evaluate performance. A common approach is to try a few discrete values for ρ (e.g., 0.1, 0.5, 0.9, 1.0) and search for the best α for each ρ.
While L2 regularization (weight decay) is often built directly into optimizers like Adam or SGD, implementing L1 or Elastic Net penalties in deep learning might require a bit more care depending on the framework.
kernel_regularizer
, bias_regularizer
, etc., directly when defining a layer. You can often pass L1, L2, or Elastic Net regularizers here.Here's a conceptual PyTorch example of adding the penalty manually:
import torch
import torch.nn as nn
# Assume 'model' is your neural network
# Assume 'loss' is the computed loss from your criterion (e.g., CrossEntropyLoss)
alpha = 0.001 # Regularization strength
l1_ratio = 0.5 # Mixing parameter (rho)
l1_lambda = alpha * l1_ratio
l2_lambda = alpha * (1 - l1_ratio)
l1_penalty = 0.0
l2_penalty = 0.0
# Iterate over model parameters (weights)
for param in model.parameters():
if param.requires_grad and param.dim() > 1: # Typically apply only to weight matrices
l1_penalty += torch.linalg.norm(param, ord=1)
l2_penalty += torch.linalg.norm(param, ord=2)**2 # L2 norm squared
# Combine original loss with penalties
total_loss = loss + l1_lambda * l1_penalty + (l2_lambda / 2) * l2_penalty
# (The 1/2 factor for L2 is common in the second formulation)
# Now proceed with backpropagation
# total_loss.backward()
# optimizer.step()
Note: This is a conceptual illustration. Efficient implementations often hook into the framework's gradient calculation or optimizer steps. Check your framework's documentation for recommended ways to apply combined L1/L2 penalties.
Elastic Net provides a flexible regularization option when you believe both sparsity and handling of correlated predictors are desirable. It requires tuning an additional hyperparameter but can lead to models that generalize better than those trained with only L1 or L2 regularization in certain situations.
© 2025 ApX Machine Learning