Having explored the theoretical underpinnings of AdaGrad, RMSprop, and Adam, let's put them into practice. Theoretical advantages are important, but seeing how these optimizers perform on a concrete machine learning task provides valuable intuition for their application. In this practical exercise, we will compare the convergence behavior and performance of these three prominent adaptive learning rate algorithms.
Our goal is to observe how quickly each optimizer minimizes the loss function and how the validation performance evolves during training. We'll use a standard setup to ensure a fair comparison.
We need a model, a dataset, and a loss function. For simplicity and reproducibility, let's use:
The core training loop remains similar regardless of the optimizer, but the instantiation step changes. Here’s a conceptual outline using PyTorch-like syntax:
Load Data: Prepare DataLoader
instances for training and validation sets.
Define Model: Instantiate your MLP model.
# Example Model Definition (Conceptual)
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleMLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 784) # Flatten image
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.log_softmax(self.fc3(x), dim=1) # Output log probabilities
return x
model = SimpleMLP()
# Move model to GPU if available
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)
Define Loss:
criterion = nn.NLLLoss()
Instantiate Optimizers: Create separate instances for each optimizer you want to compare. We'll use standard default hyperparameters initially, but remember these can be tuned.
import torch.optim as optim
# Common initial learning rate (example)
lr = 0.001
# Instantiate optimizers
optimizer_adagrad = optim.Adagrad(model.parameters(), lr=lr)
optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=lr)
optimizer_adam = optim.Adam(model.parameters(), lr=lr)
# Store them for iteration (or run separate training scripts)
optimizers = {
"AdaGrad": optimizer_adagrad,
"RMSprop": optimizer_rmsprop,
"Adam": optimizer_adam
}
Note: You'll need to re-initialize the model weights before training with each optimizer to ensure a fair comparison starting from the same point.
Training Loop: For each optimizer, run the training process for a fixed number of epochs.
# Conceptual Training Loop (for one optimizer)
# num_epochs = 10
# train_losses = []
# val_accuracies = []
# # Re-initialize model weights here before starting training for THIS optimizer
# for epoch in range(num_epochs):
# model.train()
# running_loss = 0.0
# for images, labels in train_loader:
# # images, labels = images.to(device), labels.to(device) # Move data to device
# optimizer.zero_grad() # Zero gradients for this batch
# outputs = model(images)
# loss = criterion(outputs, labels)
# loss.backward() # Compute gradients
# optimizer.step() # Update weights
# running_loss += loss.item()
# epoch_loss = running_loss / len(train_loader)
# train_losses.append(epoch_loss)
# # Validation phase
# model.eval()
# correct = 0
# total = 0
# with torch.no_grad():
# for images, labels in val_loader:
# # images, labels = images.to(device), labels.to(device)
# outputs = model(images)
# _, predicted = torch.max(outputs.data, 1)
# total += labels.size(0)
# correct += (predicted == labels).sum().item()
# epoch_acc = 100 * correct / total
# val_accuracies.append(epoch_acc)
# print(f'Epoch {epoch+1}, Loss: {epoch_loss:.4f}, Val Acc: {epoch_acc:.2f}%')
# # Store results for this optimizer before training the next one
Record Results: Store the training loss and validation accuracy (or loss) for each epoch for every optimizer.
Plotting the recorded metrics is the best way to compare performance.
Training loss (log scale) versus epochs for AdaGrad, RMSprop, and Adam. Lower is better. Note the generally faster initial descent of RMSprop and Adam compared to AdaGrad.
Validation accuracy versus epochs. Higher is better. Adam and RMSprop often achieve higher accuracy faster than AdaGrad in this type of setup.
Typical Observations:
This practical comparison highlights the benefits of adaptive learning rate methods. By dynamically adjusting the step size for each parameter based on the history of gradients, AdaGrad, RMSprop, and Adam often lead to faster convergence than fixed-rate methods, especially on complex loss surfaces typical of neural networks.
While Adam is frequently a strong default choice due to its robustness and combination of techniques, understanding the behavior of AdaGrad and RMSprop provides valuable context. Remember that the best optimizer can be problem-dependent. Experimentation, guided by the principles discussed in this chapter, remains essential for achieving optimal results in your machine learning tasks. This hands-on experience equips you to make informed decisions when selecting and using these powerful optimization tools.
© 2025 ApX Machine Learning