AdaGrad, RMSprop, and Adam are adaptive learning rate algorithms with important theoretical advantages. Observing how these optimizers perform on a concrete machine learning task provides valuable intuition for their application. A comparison of the convergence behavior and performance of these three prominent adaptive learning rate algorithms is presented.Our goal is to observe how quickly each optimizer minimizes the loss function and how the validation performance evolves during training. We'll use a standard setup to ensure a fair comparison.Setting Up the ExperimentWe need a model, a dataset, and a loss function. For simplicity and reproducibility, let's use:Dataset: MNIST handwritten digits dataset. It's a classic benchmark, readily available in most ML frameworks. We'll use the standard training and validation splits.Model: A simple Multi-Layer Perceptron (MLP). A network with one or two hidden layers is sufficient to observe optimizer differences. For instance, an architecture like: Input (784 features) -> Linear(128 units) -> ReLU -> Linear(64 units) -> ReLU -> Linear(10 units) -> LogSoftmax.Loss Function: Negative Log-Likelihood Loss (NLLLoss) or Cross-Entropy Loss, suitable for multi-class classification.Framework: We'll assume the use of PyTorch or TensorFlow/Keras, as implementing these optimizers from scratch is outside this scope. These frameworks provide readily available implementations.Implementation StepsThe core training loop remains similar regardless of the optimizer, but the instantiation step changes. Here's an outline using PyTorch-like syntax:Load Data: Prepare DataLoader instances for training and validation sets.Define Model: Instantiate your MLP model.Example Model Definitionimport torch import torch.nn as nn import torch.nn.functional as F class SimpleMLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 128) self.fc2 = nn.Linear(128, 64) self.fc3 = nn.Linear(64, 10) def forward(self, x): x = x.view(-1, 784) # Flatten image x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = F.log_softmax(self.fc3(x), dim=1) # Output log probabilities return x model = SimpleMLP() # Move model to GPU if available # device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # model.to(device) ```3. Define Loss: python criterion = nn.NLLLoss() 4. Instantiate Optimizers: Create separate instances for each optimizer you want to compare. We'll use standard default hyperparameters initially, but remember these can be tuned. ```python import torch.optim as optim# Common initial learning rate (example) lr = 0.001 # Instantiate optimizers optimizer_adagrad = optim.Adagrad(model.parameters(), lr=lr) optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=lr) optimizer_adam = optim.Adam(model.parameters(), lr=lr) # Store them for iteration (or run separate training scripts) optimizers = { "AdaGrad": optimizer_adagrad, "RMSprop": optimizer_rmsprop, "Adam": optimizer_adam } ``` *Note:* You'll need to re-initialize the model weights before training with *each* optimizer to ensure a fair comparison starting from the same point.5. Training Loop: For each optimizer, run the training process for a fixed number of epochs. ```pythonTraining Loop (for one optimizer)# num_epochs = 10 # train_losses = [] # val_accuracies = [] # # Re-initialize model weights here before starting training for THIS optimizer # for epoch in range(num_epochs): # model.train() # running_loss = 0.0 # for images, labels in train_loader: # # images, labels = images.to(device), labels.to(device) # Move data to device # optimizer.zero_grad() # Zero gradients for this batch # outputs = model(images) # loss = criterion(outputs, labels) # loss.backward() # Compute gradients # optimizer.step() # Update weights # running_loss += loss.item() # epoch_loss = running_loss / len(train_loader) # train_losses.append(epoch_loss) # # Validation phase # model.eval() # correct = 0 # total = 0 # with torch.no_grad(): # for images, labels in val_loader: # # images, labels = images.to(device), labels.to(device) # outputs = model(images) # _, predicted = torch.max(outputs.data, 1) # total += labels.size(0) # correct += (predicted == labels).sum().item() # epoch_acc = 100 * correct / total # val_accuracies.append(epoch_acc) # print(f'Epoch {epoch+1}, Loss: {epoch_loss:.4f}, Val Acc: {epoch_acc:.2f}%') # # Store results for this optimizer before training the next one ```6. Record Results: Store the training loss and validation accuracy (or loss) for each epoch for every optimizer.Visualizing and Interpreting ResultsPlotting the recorded metrics is the best way to compare performance.{"layout": {"title": "Training Loss Comparison", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Training Loss (NLL)", "type": "log"}, "hovermode": "x unified", "legend": {"title": "Optimizer"}, "width": 700, "height": 400}, "data": [{"type": "scatter", "mode": "lines", "name": "AdaGrad", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [0.65, 0.35, 0.30, 0.27, 0.25, 0.23, 0.22, 0.21, 0.20, 0.19], "line": {"color": "#ff922b"}}, {"type": "scatter", "mode": "lines", "name": "RMSprop", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [0.40, 0.20, 0.15, 0.12, 0.10, 0.08, 0.07, 0.06, 0.05, 0.04], "line": {"color": "#15aabf"}}, {"type": "scatter", "mode": "lines", "name": "Adam", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [0.35, 0.18, 0.13, 0.10, 0.08, 0.07, 0.06, 0.05, 0.04, 0.035], "line": {"color": "#be4bdb"}}]}Training loss (log scale) versus epochs for AdaGrad, RMSprop, and Adam. Lower is better. Note the generally faster initial descent of RMSprop and Adam compared to AdaGrad.{"layout": {"title": "Validation Accuracy Comparison", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Validation Accuracy (%)", "range": [85, 100]}, "hovermode": "x unified", "legend": {"title": "Optimizer"}, "width": 700, "height": 400}, "data": [{"type": "scatter", "mode": "lines", "name": "AdaGrad", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [90.1, 92.5, 93.5, 94.0, 94.3, 94.5, 94.7, 94.8, 94.9, 95.0], "line": {"color": "#ff922b"}}, {"type": "scatter", "mode": "lines", "name": "RMSprop", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [92.0, 94.5, 95.5, 96.0, 96.3, 96.5, 96.7, 96.8, 96.9, 97.0], "line": {"color": "#15aabf"}}, {"type": "scatter", "mode": "lines", "name": "Adam", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "y": [92.5, 95.0, 96.0, 96.4, 96.7, 96.9, 97.0, 97.1, 97.2, 97.3], "line": {"color": "#be4bdb"}}]}Validation accuracy versus epochs. Higher is better. Adam and RMSprop often achieve higher accuracy faster than AdaGrad in this type of setup.Typical Observations:Initial Convergence: Adam and RMSprop usually exhibit faster initial progress in reducing the loss compared to AdaGrad. This is often attributed to RMSprop's fix for AdaGrad's rapidly diminishing learning rates and Adam's addition of momentum.AdaGrad's Learning Rate Decay: You might observe that AdaGrad's progress slows down significantly in later epochs. Because it accumulates squared gradients over all past time steps, the learning rate for frequently updated parameters can become very small, potentially hindering convergence to the optimal solution.Adam as a Default: Adam often performs well with default hyperparameters ($\beta_1=0.9, \beta_2=0.999$) across a range of problems, making it a popular starting point. It combines the benefits of RMSprop (per-parameter scaling based on recent gradient magnitudes) and momentum.Final Performance: While Adam and RMSprop often converge faster, the final validation performance might be very close between them, or sometimes even slightly better for RMSprop or a well-tuned SGD with momentum in specific scenarios. AdaGrad might lag behind if its learning rates decay too aggressively.Hyperparameter Sensitivity: Although adaptive methods reduce the need to tune the base learning rate $\eta$ compared to standard SGD, their performance can still be sensitive to other hyperparameters (like $\beta_1, \beta_2$ in Adam, or $\alpha$ in RMSprop) and the initial learning rate itself. The default values are good starting points, but tuning might yield better results.ConclusionThis practical comparison highlights the benefits of adaptive learning rate methods. By dynamically adjusting the step size for each parameter based on the history of gradients, AdaGrad, RMSprop, and Adam often lead to faster convergence than fixed-rate methods, especially on complex loss surfaces typical of neural networks.While Adam is frequently a strong default choice due to its robustness and combination of techniques, understanding the behavior of AdaGrad and RMSprop provides valuable context. Remember that the best optimizer can be problem-dependent. Experimentation, guided by the principles discussed in this chapter, remains essential for achieving optimal results in your machine learning tasks. This hands-on experience equips you to make informed decisions when selecting and using these powerful optimization tools.