Now that we've explored the mechanics of Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG), let's see how to put them into practice using a common deep learning framework like PyTorch. Fortunately, these frameworks provide convenient implementations, allowing us to focus on selecting and configuring the optimizer rather than writing the update rules from scratch.
PyTorch's torch.optim
module contains implementations of various optimization algorithms. The foundational optimizers we've discussed, including standard SGD, Momentum, and NAG, are all accessible through the torch.optim.SGD
class.
To use standard SGD, you instantiate the SGD
optimizer, passing the model's parameters (obtained via model.parameters()
) and the desired learning rate (lr
).
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network (e.g., nn.Linear(10, 2))
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer
learning_rate = 0.01
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# --- Inside a typical training loop ---
# Assume 'data' and 'targets' are batches from your dataloader
# Assume 'criterion' is your loss function (e.g., nn.CrossEntropyLoss())
# for data, targets in dataloader: # Example loop start
# # 1. Zero the gradients from the previous step
# optimizer.zero_grad()
#
# # 2. Perform the forward pass through the model
# outputs = model(data)
#
# # 3. Calculate the loss
# loss = criterion(outputs, targets)
#
# # 4. Perform the backward pass to compute gradients
# loss.backward()
#
# # 5. Update the model parameters using the optimizer
# optimizer.step()
# --- End of training loop snippet ---
print(f"Optimizer instantiated: {optimizer}")
The core training loop structure remains consistent regardless of the specific optimizer used within this family: zero gradients, forward pass, compute loss, backward pass, and then optimizer.step()
. The optimizer.step()
call is where the magic happens; it applies the specific update rule defined by the optimizer instance. For standard SGD, this corresponds to the update w←w−η∇L(w), where w represents the parameters, η is the learning rate, and ∇L(w) is the gradient of the loss.
To incorporate Momentum, you simply specify the momentum
hyperparameter when creating the SGD
optimizer. This parameter typically takes values between 0.5 and 0.99, with 0.9 being a common starting point.
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer with Momentum
learning_rate = 0.01
momentum_coefficient = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_coefficient)
# The training loop structure remains the same as standard SGD.
# optimizer.step() now applies the Momentum update rule.
print(f"Optimizer instantiated: {optimizer}")
Behind the scenes, when momentum
is specified, optimizer.step()
maintains velocity vectors v for each parameter. It then applies the Momentum update rule discussed previously:
v←γv+η∇L(w)
w←w−v
Here, γ is the momentum
coefficient you provided. The training loop code itself doesn't change, but the parameter updates performed by optimizer.step()
are now accelerated.
Nesterov Accelerated Gradient (NAG) is also available within the torch.optim.SGD
class. To use it, you must provide a non-zero momentum
value (similar to standard Momentum) and additionally set the boolean nesterov
argument to True
.
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer with Nesterov Momentum
learning_rate = 0.01
momentum_coefficient = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_coefficient, nesterov=True)
# The training loop structure remains the same.
# optimizer.step() now applies the NAG update rule.
print(f"Optimizer instantiated: {optimizer}")
Setting nesterov=True
modifies the update step to incorporate the "lookahead" gradient calculation characteristic of NAG. Remember that NAG requires the momentum
parameter to be set to a value greater than 0; setting nesterov=True
without setting momentum
will not activate the Nesterov update.
While the implementation is straightforward thanks to libraries like PyTorch, selecting the right optimizer variant and tuning its hyperparameters (like lr
and momentum
) is important for achieving good model performance and efficient training.
momentum
hyperparameter, which needs tuning alongside the learning rate.lr
, momentum
) as standard Momentum.Experimentation is often necessary. A common practice is to start with SGD with Momentum (or NAG) using typical hyperparameter values (e.g., lr=0.01
, momentum=0.9
) and then tune them based on the observed training dynamics, such as the learning curve. Visualizing the training loss curves for different optimizers or hyperparameter settings can provide valuable insight.
Illustrative comparison of training loss convergence for SGD, Momentum, and NAG on a hypothetical task. Note how Momentum and NAG typically lead to faster decreases in loss compared to standard SGD. The exact behavior depends heavily on the dataset, model architecture, initialization, and hyperparameter settings.
These foundational optimizers, particularly SGD with Momentum/NAG, are still widely used and form the basis for understanding more complex adaptive methods. In the next chapter, we will explore algorithms like AdaGrad, RMSprop, and Adam, which introduce mechanisms to automatically adapt the learning rate during training.
© 2025 ApX Machine Learning