Let's put the theory into practice. In the previous chapters, we examined various techniques for regularization (L1/L2, Dropout, Early Stopping, Data Augmentation) and optimization (SGD variants, Adam, RMSprop, Learning Rate Schedules) individually. Now, we'll integrate several of these methods into a typical deep learning workflow to build and tune a model, observing how they work together to improve generalization.
Our goal is to train a Convolutional Neural Network (CNN) for image classification on the Fashion-MNIST dataset. We'll start with a basic model and iteratively add components, monitoring the effects on training dynamics and validation performance.
First, ensure you have PyTorch and torchvision
installed. We'll use Fashion-MNIST, a dataset of 28x28 grayscale images of clothing items, split into 10 categories. It's a standard benchmark slightly more complex than MNIST digits.
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Hyperparameters (Initial)
num_epochs = 15
batch_size = 128
learning_rate = 0.001
# Data loading and transformation
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)) # Normalize for grayscale images
])
train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transform)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
Let's define a simple CNN architecture without explicit regularization beyond standard optimization.
# Simple CNN Architecture
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
# Flatten the output for the fully connected layer
# Input image 28x28 -> pool1 -> 14x14 -> pool2 -> 7x7
# Output features: 32 channels * 7 * 7
self.fc1 = nn.Linear(32 * 7 * 7, 128)
self.relu3 = nn.ReLU()
self.fc2 = nn.Linear(128, 10) # 10 classes
def forward(self, x):
out = self.pool1(self.relu1(self.conv1(x)))
out = self.pool2(self.relu2(self.conv2(out)))
out = out.view(out.size(0), -1) # Flatten
out = self.relu3(self.fc1(out))
out = self.fc2(out)
return out
# Instantiate baseline model, loss, and optimizer
model_base = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer_base = optim.Adam(model_base.parameters(), lr=learning_rate)
# --- Placeholder for Baseline Training Loop ---
# You would typically train this model here, recording train/validation loss and accuracy per epoch.
# We will simulate the results for brevity.
print("Baseline model defined. (Training simulation follows)")
After training the baseline model, we might observe learning curves like the simulated ones below. Often, the training loss decreases steadily while the validation loss starts to increase after some epochs, indicating overfitting.
Simulated learning curves for the baseline model. Note the increasing validation loss and stagnating validation accuracy, while training loss/accuracy continues to improve, a classic sign of overfitting.
Now, let's enhance our model by adding Batch Normalization, Dropout, and L2 regularization (Weight Decay). We'll also stick with the Adam optimizer.
We need to add nn.BatchNorm2d
after convolutional layers (usually before activation) and nn.Dropout
typically after activation in fully connected layers.
# Enhanced CNN Architecture
class EnhancedCNN(nn.Module):
def __init__(self, dropout_rate=0.5):
super(EnhancedCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm2d(16) # Added BN
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
self.bn2 = nn.BatchNorm2d(32) # Added BN
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
self.fc1 = nn.Linear(32 * 7 * 7, 128)
self.relu3 = nn.ReLU()
self.dropout = nn.Dropout(dropout_rate) # Added Dropout
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
# Conv block 1
out = self.conv1(x)
out = self.bn1(out)
out = self.relu1(out)
out = self.pool1(out)
# Conv block 2
out = self.conv2(out)
out = self.bn2(out)
out = self.relu2(out)
out = self.pool2(out)
# Flatten and FC layers
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.relu3(out)
out = self.dropout(out) # Apply dropout before final layer
out = self.fc2(out)
return out
# Instantiate enhanced model and criterion
model_enhanced = EnhancedCNN(dropout_rate=0.5).to(device)
criterion = nn.CrossEntropyLoss() # Same loss function
# --- Note on Optimizer Setup ---
# L2 Regularization (Weight Decay) is added directly in the optimizer
l2_lambda = 0.0001 # Example L2 strength
optimizer_enhanced = optim.Adam(model_enhanced.parameters(),
lr=learning_rate,
weight_decay=l2_lambda)
# --- Placeholder for Enhanced Training Loop ---
# Similar training loop as before, but using model_enhanced and optimizer_enhanced.
# Remember to set model.train() and model.eval() appropriately due to BN and Dropout.
print(f"Enhanced model defined with Dropout, BatchNorm, and L2 Weight Decay (lambda={l2_lambda}).")
nn.BatchNorm2d
): Added after each convolutional layer, before the ReLU activation. This helps stabilize training, allows potentially higher learning rates, and provides a slight regularization effect.nn.Dropout
): Added after the activation of the first fully connected layer. This randomly sets a fraction of the inputs to zero during training, preventing over-reliance on specific neurons and encouraging feature redundancy.optim.Adam
optimizer via the weight_decay
parameter. This penalizes large weights, encouraging simpler models.When training a model with Dropout and Batch Normalization, it's important to manage the model's state:
model.train()
before the training loop for each epoch. This enables Dropout and ensures BN uses batch statistics.model.eval()
before the validation/testing loop. This disables Dropout and ensures BN uses the running estimates of mean and variance accumulated during training.After training the enhanced model, we compare its learning curves to the baseline.
Simulated comparison of learning curves. The enhanced model shows slower initial training convergence (due to regularization) but achieves lower validation loss and higher validation accuracy, with a smaller gap between training and validation metrics, indicating better generalization.
Observations:
This practical session demonstrates the integration of common techniques. However, finding the optimal combination often requires experimentation:
dropout_rate
, weight_decay
(L2 lambda), and learning_rate
. Use techniques like random search or more advanced Bayesian optimization.torch.optim.lr_scheduler.StepLR
or CosineAnnealingLR
) to potentially improve convergence further.transforms.Compose
pipeline for the training set. This acts as another powerful form of regularization.Example incorporating LR Scheduler and suggesting Early Stopping logic:
# ... (Enhanced model and dataset setup as before) ...
optimizer_enhanced = optim.Adam(model_enhanced.parameters(), lr=learning_rate, weight_decay=l2_lambda)
# Add a learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer_enhanced, step_size=5, gamma=0.1) # Reduce LR every 5 epochs
# --- Placeholder for Training Loop with Scheduler and Early Stopping Logic ---
# Inside your epoch loop:
# model_enhanced.train()
# ... (forward pass, loss calculation, backward pass) ...
# optimizer_enhanced.step()
# scheduler.step() # Update learning rate
#
# model_enhanced.eval()
# ... (validation loop) ...
# Check validation loss for early stopping criterion
# ---
print("Training setup includes Adam, L2, Dropout, BN, LR Scheduler.")
This hands-on exercise demonstrates how combining regularization techniques like Dropout, Batch Normalization, and Weight Decay with appropriate optimization strategies like Adam leads to models that generalize better than simpler baseline models. By systematically adding these components and monitoring their effects using validation metrics and learning curves, you can effectively combat overfitting and build more reliable deep learning systems. Remember that the specific combination and tuning of these techniques depend heavily on the dataset, model architecture, and the specific task at hand. Experimentation is a standard part of the process.
© 2025 ApX Machine Learning