Now that we understand the concepts behind L1, L2, and Elastic Net regularization, let's look at how to incorporate these techniques into the training process using a popular deep learning framework like PyTorch. The goal is to modify the standard training loop to account for the added penalty terms in the loss function.
Most deep learning frameworks offer convenient ways to add weight regularization. The implementation strategy often differs slightly between L2 (weight decay) and L1 regularization due to the nature of their gradients.
L2 regularization is so common that most optimizers directly implement it as a parameter often called weight_decay
. Recall the L2 regularized loss function:
When computing the gradient with respect to a weight wi, the derivative of the L2 penalty term is simply λwi. During the weight update step (e.g., in SGD), the update rule becomes:
wi←wi−η(∂wi∂Loriginal+λwi)This can be rearranged as:
wi←(1−ηλ)wi−η∂wi∂LoriginalThis update rule shows that at each step, the weight is first shrunk by a factor of (1−ηλ) before applying the standard gradient update. This is why L2 regularization is often referred to as "weight decay".
In PyTorch, you can easily add weight decay by specifying the weight_decay
argument when initializing the optimizer. This argument corresponds to the λ hyperparameter in the L2 penalty formula.
import torch
import torch.optim as optim
from torch import nn
# Assume 'model' is your defined neural network
# Example: model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
# Define original loss function
criterion = nn.MSELoss()
# Define optimizer (e.g., Adam) with weight decay (L2 regularization)
# lambda_l2 is the regularization strength
lambda_l2 = 0.01
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=lambda_l2)
# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# loss = criterion(outputs, targets)
# The weight decay is automatically handled by the optimizer during step()
# optimizer.zero_grad()
# loss.backward()
# optimizer.step()
# --- End training loop snippet ---
print(f"Optimizer initialized with weight decay (L2 lambda): {lambda_l2}")
Using the weight_decay
parameter is the standard and most efficient way to implement L2 regularization in PyTorch.
L1 regularization adds a penalty proportional to the sum of the absolute values of the weights:
Ltotal=Loriginal+λ1w∑∣w∣The gradient of the L1 penalty term is λ1sign(w), which is problematic because the sign function is undefined at w=0 and its derivative (needed for some more advanced optimizers) is zero everywhere else except at w=0 where it's infinite. Because of this, L1 regularization is typically not implemented directly within optimizers like weight decay.
Instead, we usually add the L1 penalty term manually to the loss before performing the backward pass.
import torch
import torch.optim as optim
from torch import nn
# Assume 'model' is your defined neural network
# model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
# Define original loss function
criterion = nn.MSELoss()
# Define optimizer (e.g., SGD) without built-in L1
optimizer = optim.SGD(model.parameters(), lr=0.001)
# L1 regularization strength
lambda_l1 = 0.005
# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# original_loss = criterion(outputs, targets)
# Calculate L1 penalty
l1_penalty = 0
for param in model.parameters():
# Ensure we only penalize weights, not biases (optional but common)
if param.dim() > 1:
l1_penalty += torch.sum(torch.abs(param))
# Add L1 penalty to the original loss
total_loss = original_loss + lambda_l1 * l1_penalty
# Backpropagate the total loss
# optimizer.zero_grad()
# total_loss.backward()
# optimizer.step()
# --- End training loop snippet ---
print(f"Manually adding L1 penalty with lambda: {lambda_l1}")
# print(f"Example Original Loss: {original_loss.item():.4f}") # Placeholder value
# print(f"Example L1 Penalty: {l1_penalty.item():.4f}") # Placeholder value
# print(f"Example Total Loss: {total_loss.item():.4f}") # Placeholder value
In the snippet above, we iterate through the model's parameters, calculate the sum of their absolute values (the L1 norm), scale it by the hyperparameter λ1, and add it to the original loss computed by the criterion. The backward pass then computes gradients based on this combined total_loss
. Note the common practice of only penalizing weight matrices/tensors (where param.dim() > 1
) and not bias vectors, though this is a design choice.
Elastic Net combines L1 and L2 penalties:
Ltotal=Loriginal+λ1w∑∣w∣+2λ2w∑w2To implement Elastic Net, we simply combine the two techniques described above: use the optimizer's weight_decay
parameter for the L2 part (λ2) and manually add the L1 penalty term (λ1) to the loss.
import torch
import torch.optim as optim
from torch import nn
# Assume 'model' is your defined neural network
# model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))
# Define original loss function
criterion = nn.MSELoss()
# Regularization strengths
lambda_l1 = 0.005
lambda_l2 = 0.01
# Define optimizer with weight decay for the L2 part
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=lambda_l2)
# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# original_loss = criterion(outputs, targets)
# Calculate L1 penalty
l1_penalty = 0
for param in model.parameters():
if param.dim() > 1:
l1_penalty += torch.sum(torch.abs(param))
# Add L1 penalty to the original loss
# The L2 penalty is handled by the optimizer's weight_decay
total_loss = original_loss + lambda_l1 * l1_penalty
# Backpropagate the total loss (L1 part + original loss)
# optimizer.zero_grad()
# total_loss.backward()
# optimizer.step() # Optimizer applies L2 decay during the update
# --- End training loop snippet ---
print(f"Implementing Elastic Net with L1 lambda: {lambda_l1}, L2 lambda (weight_decay): {lambda_l2}")
The regularization strengths, λ, λ1, and λ2, are hyperparameters that control the impact of the penalty terms.
Finding appropriate values for these hyperparameters is part of the model tuning process. Common values often range logarithmically, for example, from 10−6 to 10−1. Techniques like grid search or random search over a validation set are typically used to find good values for λ. We will discuss hyperparameter tuning in more detail in Chapter 7.
With these implementation methods, you can now effectively apply L1, L2, or Elastic Net regularization to your neural network models, providing you with powerful tools to combat overfitting and improve generalization. The next section provides a hands-on practical exercise to solidify these concepts.
© 2025 ApX Machine Learning