Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent, and the Momentum update are compared in practice. Understanding their behavior on a concrete task helps build intuition about their strengths and weaknesses.We'll set up a simple regression problem and train a basic linear model using these different optimization strategies. Our goal is to observe differences in convergence speed and the smoothness of the learning process.Setting Up the ExperimentFirst, let's define a simple synthetic dataset. We'll generate data following a linear relationship with some added noise, which is a common scenario.import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader import numpy as np # Generate synthetic data: y = 2x + 1 + noise np.random.seed(42) X_numpy = np.random.rand(100, 1) * 10 y_numpy = 2 * X_numpy + 1 + np.random.randn(100, 1) * 2 # Add some noise # Convert to PyTorch tensors X = torch.tensor(X_numpy, dtype=torch.float32) y = torch.tensor(y_numpy, dtype=torch.float32) # Define a simple linear model model = nn.Linear(1, 1) # Define the loss function (Mean Squared Error) criterion = nn.MSELoss() # Create datasets dataset = TensorDataset(X, y)We have 100 data points where $y$ is approximately $2x + 1$. Our nn.Linear(1, 1) model tries to learn the weight (slope, target=2) and bias (intercept, target=1). We'll use Mean Squared Error (MSE) as our loss function.Defining the OptimizersWe will compare three optimization approaches:Mini-batch Gradient Descent (Batch Size 16): A common practical approach. We'll use PyTorch's SGD optimizer with a standard learning rate.Stochastic Gradient Descent (Batch Size 1): The extreme case of mini-batch, updating after every single sample.SGD with Momentum (Batch Size 16): Using the Momentum variant to potentially accelerate convergence.We need separate model instances and optimizers for each experiment to ensure a fair comparison.# Learning rate learning_rate = 0.001 momentum_factor = 0.9 num_epochs = 50 # --- Optimizers --- # 1. Mini-batch GD (Batch Size 16) model_minibatch = nn.Linear(1, 1) optimizer_minibatch = optim.SGD(model_minibatch.parameters(), lr=learning_rate) dataloader_minibatch = DataLoader(dataset, batch_size=16, shuffle=True) # 2. SGD (Batch Size 1) model_sgd = nn.Linear(1, 1) optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=learning_rate) dataloader_sgd = DataLoader(dataset, batch_size=1, shuffle=True) # 3. SGD with Momentum (Batch Size 16) model_momentum = nn.Linear(1, 1) optimizer_momentum = optim.SGD(model_momentum.parameters(), lr=learning_rate, momentum=momentum_factor) # We use the same dataloader as Mini-batch GD for fair comparison dataloader_momentum = DataLoader(dataset, batch_size=16, shuffle=True)Notice that for true SGD, we set batch_size=1. For Mini-batch and Momentum, we use batch_size=16. The Momentum optimizer is identical to the Mini-batch one, except for the added momentum=0.9 parameter.Training LoopThe training loop structure is similar for all optimizers. We iterate through epochs, and within each epoch, we iterate through the data batches provided by the DataLoader.def train_model(model, optimizer, dataloader, criterion, epochs): """Helper function to train a model and record loss per epoch.""" epoch_losses = [] for epoch in range(epochs): epoch_loss = 0.0 num_batches = 0 for inputs, targets in dataloader: # Zero the parameter gradients optimizer.zero_grad() # Forward pass outputs = model(inputs) loss = criterion(outputs, targets) # Backward pass and optimize loss.backward() optimizer.step() epoch_loss += loss.item() num_batches += 1 avg_epoch_loss = epoch_loss / num_batches epoch_losses.append(avg_epoch_loss) # Optional: Print progress # if (epoch + 1) % 10 == 0: # print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_epoch_loss:.4f}') return epoch_losses # Train each model losses_minibatch = train_model(model_minibatch, optimizer_minibatch, dataloader_minibatch, criterion, num_epochs) losses_sgd = train_model(model_sgd, optimizer_sgd, dataloader_sgd, criterion, num_epochs) losses_momentum = train_model(model_momentum, optimizer_momentum, dataloader_momentum, criterion, num_epochs) print("Training finished.")Comparing PerformanceThe most straightforward way to compare these optimizers is by plotting their training loss over the epochs. Lower loss generally indicates better model fit, and faster decrease indicates quicker convergence.{"layout": {"title": "Optimizer Comparison: Training Loss", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Average MSE Loss", "type": "log"}, "template": "plotly_white", "legend": {"title": "Optimizer"}}, "data": [{"type": "scatter", "mode": "lines", "name": "Mini-batch (BS=16)", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], "y": [29.5, 18.1, 11.9, 8.6, 6.8, 5.7, 5.1, 4.8, 4.6, 4.4, 4.3, 4.3, 4.2, 4.2, 4.1, 4.1, 4.1, 4.1, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0], "line": {"color": "#339af0"}}, {"type": "scatter", "mode": "lines", "name": "SGD (BS=1)", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], "y": [24.3, 17.9, 14.2, 12.2, 10.9, 9.9, 9.0, 8.4, 7.9, 7.5, 7.0, 6.7, 6.5, 6.3, 6.1, 5.9, 5.7, 5.6, 5.5, 5.4, 5.3, 5.2, 5.2, 5.1, 5.1, 5.0, 5.0, 5.0, 4.9, 4.9, 4.9, 4.8, 4.8, 4.8, 4.8, 4.8, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.6, 4.6, 4.6, 4.6, 4.6], "line": {"color": "#ff922b"}}, {"type": "scatter", "mode": "lines", "name": "Momentum (BS=16)", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], "y": [25.6, 11.3, 6.1, 4.4, 3.9, 3.8, 3.8, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9, 3.9], "line": {"color": "#51cf66"}}]}Average training loss per epoch for Mini-batch GD (batch size 16), SGD (batch size 1), and SGD with Momentum (batch size 16) on the synthetic linear regression task. Note the logarithmic scale on the y-axis.Analysis of ResultsLet's analyze the plot (keeping in mind that results might vary slightly due to random initialization and data shuffling):Mini-batch GD (Blue Line): Shows a relatively smooth decrease in loss. Using batches of 16 samples averages out some of the gradient noise present in pure SGD, leading to stable convergence. It reaches a low loss value reasonably quickly.SGD (Orange Line): The loss curve is noticeably noisier, especially in the initial epochs. Because updates are based on single samples, the gradients can fluctuate significantly, causing the loss to jump around more. While it eventually converges, it takes more epochs than Mini-batch GD or Momentum to reach a similar loss level in this example. The noise can sometimes help escape poor local minima, but often slows down convergence.Momentum (Green Line): This typically shows the fastest initial convergence. The momentum term helps accelerate progress in consistent directions of descent and dampens oscillations that might occur with plain SGD or Mini-batch GD. It quickly settles near the minimum loss value. In this simple case, it converges significantly faster than the other two methods.TakeawaysThis practical comparison highlights the characteristics we discussed earlier:Batch Size Matters: Mini-batch GD (size > 1) offers a balance between the computational efficiency of SGD and the stable convergence of Batch GD (which we didn't run due to its inefficiency on large datasets). Its convergence is smoother than pure SGD.SGD is Noisy: Using a batch size of 1 introduces significant variance in the gradient estimates, leading to a noisy optimization path. This can sometimes be beneficial but often slows down convergence.Momentum Accelerates: By incorporating velocity, Momentum often converges faster than plain SGD or Mini-batch GD, especially on surfaces where the gradient direction is somewhat consistent or when navigating shallow areas.This exercise demonstrates that even foundational algorithms have distinct behaviors. While Mini-batch SGD is a workhorse, Momentum often provides a noticeable speedup, creating a path for the adaptive methods we'll explore in the next chapter. Remember that the best choice often depends on the specific problem, dataset, and model architecture. Experimentation is frequently necessary.