Now that we've covered the theory behind Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent, and the Momentum update, let's see how they perform in practice. Understanding their behavior on a concrete task helps build intuition about their strengths and weaknesses.
We'll set up a simple regression problem and train a basic linear model using these different optimization strategies. Our goal is to observe differences in convergence speed and the smoothness of the learning process.
First, let's define a simple synthetic dataset. We'll generate data following a linear relationship with some added noise, which is a common scenario.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
# Generate synthetic data: y = 2x + 1 + noise
np.random.seed(42)
X_numpy = np.random.rand(100, 1) * 10
y_numpy = 2 * X_numpy + 1 + np.random.randn(100, 1) * 2 # Add some noise
# Convert to PyTorch tensors
X = torch.tensor(X_numpy, dtype=torch.float32)
y = torch.tensor(y_numpy, dtype=torch.float32)
# Define a simple linear model
model = nn.Linear(1, 1)
# Define the loss function (Mean Squared Error)
criterion = nn.MSELoss()
# Create datasets
dataset = TensorDataset(X, y)
We have 100 data points where y is approximately 2x+1. Our nn.Linear(1, 1)
model tries to learn the weight (slope, target=2) and bias (intercept, target=1). We'll use Mean Squared Error (MSE) as our loss function.
We will compare three optimization approaches:
SGD
optimizer with a standard learning rate.We need separate model instances and optimizers for each experiment to ensure a fair comparison.
# Learning rate
learning_rate = 0.001
momentum_factor = 0.9
num_epochs = 50
# --- Optimizers ---
# 1. Mini-batch GD (Batch Size 16)
model_minibatch = nn.Linear(1, 1)
optimizer_minibatch = optim.SGD(model_minibatch.parameters(), lr=learning_rate)
dataloader_minibatch = DataLoader(dataset, batch_size=16, shuffle=True)
# 2. SGD (Batch Size 1)
model_sgd = nn.Linear(1, 1)
optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=learning_rate)
dataloader_sgd = DataLoader(dataset, batch_size=1, shuffle=True)
# 3. SGD with Momentum (Batch Size 16)
model_momentum = nn.Linear(1, 1)
optimizer_momentum = optim.SGD(model_momentum.parameters(), lr=learning_rate, momentum=momentum_factor)
# We use the same dataloader as Mini-batch GD for fair comparison
dataloader_momentum = DataLoader(dataset, batch_size=16, shuffle=True)
Notice that for true SGD, we set batch_size=1
. For Mini-batch and Momentum, we use batch_size=16
. The Momentum optimizer is identical to the Mini-batch one, except for the added momentum=0.9
parameter.
The training loop structure is similar for all optimizers. We iterate through epochs, and within each epoch, we iterate through the data batches provided by the DataLoader
.
def train_model(model, optimizer, dataloader, criterion, epochs):
"""Helper function to train a model and record loss per epoch."""
epoch_losses = []
for epoch in range(epochs):
epoch_loss = 0.0
num_batches = 0
for inputs, targets in dataloader:
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass and optimize
loss.backward()
optimizer.step()
epoch_loss += loss.item()
num_batches += 1
avg_epoch_loss = epoch_loss / num_batches
epoch_losses.append(avg_epoch_loss)
# Optional: Print progress
# if (epoch + 1) % 10 == 0:
# print(f'Epoch [{epoch+1}/{epochs}], Loss: {avg_epoch_loss:.4f}')
return epoch_losses
# Train each model
losses_minibatch = train_model(model_minibatch, optimizer_minibatch, dataloader_minibatch, criterion, num_epochs)
losses_sgd = train_model(model_sgd, optimizer_sgd, dataloader_sgd, criterion, num_epochs)
losses_momentum = train_model(model_momentum, optimizer_momentum, dataloader_momentum, criterion, num_epochs)
print("Training finished.")
The most straightforward way to compare these optimizers is by plotting their training loss over the epochs. Lower loss generally indicates better model fit, and faster decrease indicates quicker convergence.
Average training loss per epoch for Mini-batch GD (batch size 16), SGD (batch size 1), and SGD with Momentum (batch size 16) on the synthetic linear regression task. Note the logarithmic scale on the y-axis.
Let's analyze the plot (keeping in mind that results might vary slightly due to random initialization and data shuffling):
This practical comparison highlights the characteristics we discussed earlier:
This exercise demonstrates that even foundational algorithms have distinct behaviors. While Mini-batch SGD is a workhorse, Momentum often provides a noticeable speedup, paving the way for the adaptive methods we'll explore in the next chapter. Remember that the best choice often depends on the specific problem, dataset, and model architecture. Experimentation is frequently necessary.
© 2025 ApX Machine Learning