Now that we've explored the mechanisms behind adaptive optimization algorithms like AdaGrad, RMSprop, and Adam, let's see how they perform in practice compared to foundational methods like SGD and SGD with Momentum. This hands-on section will guide you through setting up and running an experiment to compare these optimizers on a common task, allowing you to observe their effects on training speed and model performance firsthand.
The goal is not just to see which optimizer "wins" on a specific problem, but to understand how their different update mechanisms lead to observable differences in the training process, such as convergence rate and stability.
We'll use a simple task: classifying handwritten digits from the MNIST dataset. This dataset is complex enough to highlight differences between optimizers but simple enough to train quickly. We'll use a basic Multi-Layer Perceptron (MLP) for this task.
First, let's define our neural network architecture using PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleMLP(nn.Module):
def __init__(self, input_size=784, hidden_size=128, num_classes=10):
super(SimpleMLP, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# Flatten the image
x = x.view(x.size(0), -1)
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
# No softmax here, as CrossEntropyLoss expects raw logits
return out
# Define input size (MNIST images are 28x28 = 784 pixels)
input_size = 784
hidden_size = 128
num_classes = 10
Next, we need to load the MNIST dataset. We'll use torchvision
for this. We'll also create data loaders for training and validation.
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Transformations
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std
])
# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
val_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
# Create DataLoaders
batch_size = 64
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False)
We'll use the standard Cross-Entropy Loss function, suitable for multi-class classification.
criterion = nn.CrossEntropyLoss()
Our experiment will involve training identical instances of our SimpleMLP
model on the MNIST training data using four different optimizers:
For each optimizer, we will:
SimpleMLP
model to ensure a fair start.lr=0.001
) as a starting point. Note that optimal learning rates often differ between optimizers, but we'll use a common one for this initial comparison.Here's a sketch of the training function. We'll pass the model, data loaders, criterion, and the specific optimizer instance to this function.
import torch.optim as optim
from collections import defaultdict
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=10):
"""Trains the model and returns history of losses and accuracies."""
history = defaultdict(list)
print(f"Training with optimizer: {optimizer.__class__.__name__}")
for epoch in range(num_epochs):
model.train() # Set model to training mode
running_loss = 0.0
for i, (images, labels) in enumerate(train_loader):
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
# Calculate average training loss for the epoch
epoch_loss = running_loss / len(train_loader)
history['train_loss'].append(epoch_loss)
# Validation phase
model.eval() # Set model to evaluation mode
correct = 0
total = 0
val_loss = 0.0
with torch.no_grad():
for images, labels in val_loader:
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
epoch_acc = 100 * correct / total
avg_val_loss = val_loss / len(val_loader)
history['val_loss'].append(avg_val_loss)
history['val_accuracy'].append(epoch_acc)
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {epoch_loss:.4f}, Val Loss: {avg_val_loss:.4f}, Val Accuracy: {epoch_acc:.2f}%')
print("-" * 30)
return history
# --- Experiment Execution ---
num_epochs = 10
learning_rate = 0.001
momentum = 0.9 # For SGD with Momentum
optimizers_to_test = {
"SGD": lambda params: optim.SGD(params, lr=learning_rate),
"Momentum": lambda params: optim.SGD(params, lr=learning_rate, momentum=momentum),
"RMSprop": lambda params: optim.RMSprop(params, lr=learning_rate),
"Adam": lambda params: optim.Adam(params, lr=learning_rate)
}
results = {}
for name, optimizer_lambda in optimizers_to_test.items():
# Initialize a fresh model for each optimizer
model = SimpleMLP(input_size, hidden_size, num_classes)
optimizer_instance = optimizer_lambda(model.parameters())
history = train_model(model, train_loader, val_loader, criterion, optimizer_instance, num_epochs=num_epochs)
results[name] = history
# results dictionary now holds the training/validation history for each optimizer
After running the training loops, the results
dictionary contains the training loss and validation accuracy per epoch for each optimizer. Let's visualize these results to compare their performance.
We'll plot the training loss curves and validation accuracy curves.
Comparison of training loss curves for different optimizers over 10 epochs.
Comparison of validation accuracy curves for different optimizers over 10 epochs.
From the plots (using typical example results), we can observe several patterns:
Important Considerations:
This practical exercise demonstrates the tangible differences between common optimization algorithms. Adaptive methods like Adam and RMSprop often provide faster convergence, making them efficient choices for many deep learning tasks. SGD with Momentum remains a strong contender, particularly when carefully tuned, and is sometimes favored for its potential generalization benefits in certain scenarios.
Understanding these behaviors through experimentation helps you make informed decisions when selecting and tuning optimizers for your own deep learning projects. Remember that while defaults like Adam work well in many cases, comparing alternatives can sometimes yield better results for your specific problem.
© 2025 ApX Machine Learning