All Courses

Implementing Weight Regularization

Now that we understand the concepts behind L1, L2, and Elastic Net regularization, let's look at how to incorporate these techniques into the training process using a popular deep learning framework like PyTorch. The goal is to modify the standard training loop to account for the added penalty terms in the loss function.

Adding Regularization in Practice

Most deep learning frameworks offer convenient ways to add weight regularization. The implementation strategy often differs slightly between L2 (weight decay) and L1 regularization due to the nature of their gradients.

L2 Regularization (Weight Decay)

L2 regularization is so common that most optimizers directly implement it as a parameter often called weight_decay. Recall the L2 regularized loss function:

L_{total} = L_{original} + \frac{\lambda}{2} \sum_{w} w^2

When computing the gradient with respect to a weight $w_i$ , the derivative of the L2 penalty term is simply $\lambda w_i$ . During the weight update step (e.g., in SGD), the update rule becomes:

w_i \leftarrow w_i - \eta \left( \frac{\partial L_{original}}{\partial w_i} + \lambda w_i \right)

This can be rearranged as:

w_i \leftarrow (1 - \eta \lambda) w_i - \eta \frac{\partial L_{original}}{\partial w_i}

This update rule shows that at each step, the weight is first shrunk by a factor of $(1 - \eta \lambda)$ before applying the standard gradient update. This is why L2 regularization is often referred to as "weight decay".

In PyTorch, you can easily add weight decay by specifying the weight_decay argument when initializing the optimizer. This argument corresponds to the $\lambda$ hyperparameter in the L2 penalty formula.

import torch
import torch.optim as optim
from torch import nn

# Assume 'model' is your defined neural network
# Example: model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))

# Define original loss function
criterion = nn.MSELoss() 

# Define optimizer (e.g., Adam) with weight decay (L2 regularization)
# lambda_l2 is the regularization strength
lambda_l2 = 0.01 
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=lambda_l2)

# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# loss = criterion(outputs, targets)

# The weight decay is automatically handled by the optimizer during step()
# optimizer.zero_grad()
# loss.backward() 
# optimizer.step() 
# --- End training loop snippet ---

print(f"Optimizer initialized with weight decay (L2 lambda): {lambda_l2}")

Using the weight_decay parameter is the standard and most efficient way to implement L2 regularization in PyTorch.

L1 Regularization

L1 regularization adds a penalty proportional to the sum of the absolute values of the weights:

L_{total} = L_{original} + \lambda_1 \sum_{w} |w|

The gradient of the L1 penalty term is $\lambda_1 \text{sign}(w)$ , which is problematic because the sign function is undefined at $w=0$ and its derivative (needed for some more advanced optimizers) is zero everywhere else except at $w=0$ where it's infinite. Because of this, L1 regularization is typically not implemented directly within optimizers like weight decay.

Instead, we usually add the L1 penalty term manually to the loss before performing the backward pass.

import torch
import torch.optim as optim
from torch import nn

# Assume 'model' is your defined neural network
# model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))

# Define original loss function
criterion = nn.MSELoss() 

# Define optimizer (e.g., SGD) without built-in L1
optimizer = optim.SGD(model.parameters(), lr=0.001) 

# L1 regularization strength
lambda_l1 = 0.005 

# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# original_loss = criterion(outputs, targets)

# Calculate L1 penalty
l1_penalty = 0
for param in model.parameters():
    # Ensure we only penalize weights, not biases (optional but common)
    if param.dim() > 1: 
        l1_penalty += torch.sum(torch.abs(param))

# Add L1 penalty to the original loss
total_loss = original_loss + lambda_l1 * l1_penalty

# Backpropagate the total loss
# optimizer.zero_grad()
# total_loss.backward() 
# optimizer.step()
# --- End training loop snippet ---

print(f"Manually adding L1 penalty with lambda: {lambda_l1}")
# print(f"Example Original Loss: {original_loss.item():.4f}") # Placeholder value
# print(f"Example L1 Penalty: {l1_penalty.item():.4f}")       # Placeholder value
# print(f"Example Total Loss: {total_loss.item():.4f}")       # Placeholder value

In the snippet above, we iterate through the model's parameters, calculate the sum of their absolute values (the $L_1$ norm), scale it by the hyperparameter $\lambda_1$ , and add it to the original loss computed by the criterion. The backward pass then computes gradients based on this combined total_loss. Note the common practice of only penalizing weight matrices/tensors (where param.dim() > 1) and not bias vectors, though this is a design choice.

Elastic Net Regularization

Elastic Net combines L1 and L2 penalties:

L_{total} = L_{original} + \lambda_1 \sum_{w} |w| + \frac{\lambda_2}{2} \sum_{w} w^2

To implement Elastic Net, we simply combine the two techniques described above: use the optimizer's weight_decay parameter for the L2 part ( $\lambda_2$ ) and manually add the L1 penalty term ( $\lambda_1$ ) to the loss.

import torch
import torch.optim as optim
from torch import nn

# Assume 'model' is your defined neural network
# model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1))

# Define original loss function
criterion = nn.MSELoss() 

# Regularization strengths
lambda_l1 = 0.005 
lambda_l2 = 0.01 

# Define optimizer with weight decay for the L2 part
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=lambda_l2) 

# --- Inside the training loop ---
# Assume inputs and targets are available
# outputs = model(inputs)
# original_loss = criterion(outputs, targets)

# Calculate L1 penalty
l1_penalty = 0
for param in model.parameters():
    if param.dim() > 1:
        l1_penalty += torch.sum(torch.abs(param))

# Add L1 penalty to the original loss
# The L2 penalty is handled by the optimizer's weight_decay
total_loss = original_loss + lambda_l1 * l1_penalty

# Backpropagate the total loss (L1 part + original loss)
# optimizer.zero_grad()
# total_loss.backward() 
# optimizer.step() # Optimizer applies L2 decay during the update
# --- End training loop snippet ---

print(f"Implementing Elastic Net with L1 lambda: {lambda_l1}, L2 lambda (weight_decay): {lambda_l2}")

Choosing Regularization Strength ( $\lambda$ )

The regularization strengths, $\lambda$ , $\lambda_1$ , and $\lambda_2$ , are hyperparameters that control the impact of the penalty terms.

If $\lambda$ is too small, regularization will have little effect, and the model might still overfit.
If $\lambda$ is too large, the model might become too constrained (underfitting), prioritizing small weights over fitting the data well.

Finding appropriate values for these hyperparameters is part of the model tuning process. Common values often range logarithmically, for example, from $10^{-6}$ to $10^{-1}$ . Techniques like grid search or random search over a validation set are typically used to find good values for $\lambda$ . We will discuss hyperparameter tuning in more detail in Chapter 7.

With these implementation methods, you can now effectively apply L1, L2, or Elastic Net regularization to your neural network models, providing you with powerful tools to combat overfitting and improve generalization. The next section provides a hands-on practical exercise to solidify these concepts.

Was this section helpful?