Now that we understand the theory behind L1 and L2 regularization, let's put them into practice. In this section, we'll build a simple neural network, train it on data designed to encourage overfitting, and then apply L1 and L2 regularization to see their impact firsthand. We'll use PyTorch for implementation, but the concepts apply equally to other frameworks.
Imagine we have a binary classification problem. We'll generate some synthetic data where the decision boundary isn't perfectly linear, making it easy for a flexible model to overfit the training noise.
First, let's import necessary libraries and generate some data:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
# Generate synthetic data
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)
We'll use a simple feed-forward network with two hidden layers. This architecture is complex enough to potentially overfit our synthetic data.
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.layer1 = nn.Linear(2, 128)
self.relu1 = nn.ReLU()
self.layer2 = nn.Linear(128, 64)
self.relu2 = nn.ReLU()
self.output_layer = nn.Linear(64, 1) # Output layer for binary classification
def forward(self, x):
x = self.relu1(self.layer1(x))
x = self.relu2(self.layer2(x))
x = self.output_layer(x) # No sigmoid here, using BCEWithLogitsLoss
return x
Let's define a function to handle the training process. This will make it easier to reuse the training logic for different regularization settings.
def train_model(model, optimizer, criterion, X_train, y_train, X_val, y_val, epochs=500, l1_lambda=0.0):
train_losses = []
val_losses = []
val_accuracies = []
for epoch in range(epochs):
model.train() # Set model to training mode
# Forward pass
outputs = model(X_train)
loss = criterion(outputs, y_train)
# --- L1 Regularization (if applicable) ---
if l1_lambda > 0:
l1_penalty = 0
for param in model.parameters():
# Check if the parameter requires gradients (i.e., it's learnable)
if param.requires_grad:
l1_penalty += torch.norm(param, 1) # Calculate L1 norm
loss = loss + l1_lambda * l1_penalty
# --- End L1 Regularization ---
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
# --- Validation ---
model.eval() # Set model to evaluation mode
with torch.no_grad():
val_outputs = model(X_val)
val_loss = criterion(val_outputs, y_val)
# Calculate accuracy
predicted = torch.sigmoid(val_outputs) >= 0.5
correct = (predicted == y_val.byte()).sum().item() # Ensure comparison is correct type
val_accuracy = correct / y_val.size(0)
train_losses.append(loss.item())
val_losses.append(val_loss.item())
val_accuracies.append(val_accuracy)
if (epoch + 1) % 100 == 0:
print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}, Val Acc: {val_accuracy:.4f}')
return train_losses, val_losses, val_accuracies
Note the section specifically for adding the L1 penalty. We manually iterate through the model's parameters, calculate the L1 norm (torch.norm(param, 1)
), sum them up, multiply by the L1 strength (l1_lambda
), and add it to the original loss before backpropagation.
Let's train the model without any regularization first to establish a baseline. We'll use the Adam optimizer and binary cross-entropy loss (with logits, as our model doesn't have a final sigmoid).
# Instantiate model, criterion, and optimizer (no regularization)
model_base = SimpleNet()
criterion = nn.BCEWithLogitsLoss()
optimizer_base = optim.Adam(model_base.parameters(), lr=0.001)
print("Training Baseline Model (No Regularization)...")
base_train_loss, base_val_loss, base_val_acc = train_model(
model_base, optimizer_base, criterion,
X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor,
epochs=500
)
Typically, you'd observe the training loss decreasing steadily while the validation loss decreases initially but then starts to increase, indicating overfitting. The validation accuracy might plateau or even decrease.
Adding L2 regularization is often straightforward as many optimizers include a built-in parameter for it, commonly called weight_decay
. This parameter corresponds to the λ in the L2 penalty term 2λ∣∣w∣∣22.
# Instantiate model, criterion, and optimizer with L2
model_l2 = SimpleNet()
# Note: Re-instantiate criterion if it has state, although BCEWithLogitsLoss is stateless
criterion_l2 = nn.BCEWithLogitsLoss()
l2_lambda = 0.01 # Regularization strength
optimizer_l2 = optim.Adam(model_l2.parameters(), lr=0.001, weight_decay=l2_lambda)
print("\nTraining Model with L2 Regularization...")
l2_train_loss, l2_val_loss, l2_val_acc = train_model(
model_l2, optimizer_l2, criterion_l2,
X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor,
epochs=500
)
When training with L2 regularization, you should notice that the gap between the training loss and validation loss is smaller compared to the baseline. The validation loss might reach a lower minimum value, and the validation accuracy might improve or be more stable. L2 generally prevents weights from growing too large, leading to a smoother decision boundary.
As seen in our train_model
function, implementing L1 requires manually adding the penalty to the loss. Let's train a model using this approach.
# Instantiate model, criterion, and optimizer for L1
model_l1 = SimpleNet()
criterion_l1 = nn.BCEWithLogitsLoss()
optimizer_l1 = optim.Adam(model_l1.parameters(), lr=0.001) # No weight_decay here
l1_lambda = 0.001 # L1 regularization strength (often smaller than L2)
print("\nTraining Model with L1 Regularization...")
l1_train_loss, l1_val_loss, l1_val_acc = train_model(
model_l1, optimizer_l1, criterion_l1,
X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor,
epochs=500, l1_lambda=l1_lambda # Pass the L1 lambda to the training function
)
With L1 regularization, we also expect to see reduced overfitting, similar to L2. However, L1 has the characteristic effect of potentially driving some weights to exactly zero. This doesn't always manifest dramatically in dense networks but can contribute to simpler models. The optimal l1_lambda
might differ significantly from the optimal l2_lambda
.
Visualizing the validation loss curves for all three models is the best way to see the impact of regularization.
Validation loss curves for baseline, L1-regularized, and L2-regularized models over 500 epochs. Note how the baseline loss starts increasing (overfitting), while L1 and L2 maintain lower validation loss. (Illustrative data)
You can similarly plot the validation accuracy. The regularized models will likely show higher or more stable validation accuracy compared to the baseline model, which might degrade after initially peaking.
Validation accuracy curves corresponding to the loss plot above. Regularized models achieve better and more sustained accuracy on unseen data. (Illustrative data)
The choice of λ (the weight_decay
or l1_lambda
value) is important.
Finding the right λ typically involves hyperparameter tuning, often using techniques like grid search or random search on the validation set, which we will discuss later in the course.
This practical demonstrated how to implement L1 and L2 weight regularization in a typical PyTorch training workflow. We observed how adding these penalties to the loss function (directly for L1, or via the optimizer's weight_decay
for L2) helps combat overfitting, leading to better generalization performance as evidenced by improved validation loss and accuracy. Remember that the effectiveness and the optimal strength (λ) depend on the specific model, dataset, and task. Experimentation is often necessary to find the best configuration.
© 2025 ApX Machine Learning