Implementing Stochastic Gradient Descent (SGD), Momentum, and Nesterov Accelerated Gradient (NAG) in a common deep learning framework such as PyTorch demonstrates how to apply these optimization algorithms. Deep learning frameworks offer convenient implementations, enabling practitioners to focus on selecting and configuring the optimizer rather than developing update rules from scratch.
PyTorch's torch.optim module contains implementations of various optimization algorithms. The foundational optimizers we've discussed, including standard SGD, Momentum, and NAG, are all accessible through the torch.optim.SGD class.
To use standard SGD, you instantiate the SGD optimizer, passing the model's parameters (obtained via model.parameters()) and the desired learning rate (lr).
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network (e.g., nn.Linear(10, 2))
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer
learning_rate = 0.01
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# --- Inside a typical training loop ---
# Assume 'data' and 'targets' are batches from your dataloader
# Assume 'criterion' is your loss function (e.g., nn.CrossEntropyLoss())
# for data, targets in dataloader: # Example loop start
# # 1. Zero the gradients from the previous step
# optimizer.zero_grad()
#
# # 2. Perform the forward pass through the model
# outputs = model(data)
#
# # 3. Calculate the loss
# loss = criterion(outputs, targets)
#
# # 4. Perform the backward pass to compute gradients
# loss.backward()
#
# # 5. Update the model parameters using the optimizer
# optimizer.step()
# --- End of training loop snippet ---
print(f"Optimizer instantiated: {optimizer}")
The core training loop structure remains consistent regardless of the specific optimizer used within this family: zero gradients, forward pass, compute loss, backward pass, and then optimizer.step(). The optimizer.step() call is where the magic happens; it applies the specific update rule defined by the optimizer instance. For standard SGD, this corresponds to the update w←w−η∇L(w), where w represents the parameters, η is the learning rate, and ∇L(w) is the gradient of the loss.
To incorporate Momentum, you simply specify the momentum hyperparameter when creating the SGD optimizer. This parameter typically takes values between 0.5 and 0.99, with 0.9 being a common starting point.
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer with Momentum
learning_rate = 0.01
momentum_coefficient = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_coefficient)
# The training loop structure remains the same as standard SGD.
# optimizer.step() now applies the Momentum update rule.
print(f"Optimizer instantiated: {optimizer}")
Behind the scenes, when momentum is specified, optimizer.step() maintains velocity vectors v for each parameter. It then applies the Momentum update rule discussed previously:
v←γv+η∇L(w)
w←w−v
Here, γ is the momentum coefficient you provided. The training loop code itself doesn't change, but the parameter updates performed by optimizer.step() are now accelerated.
Nesterov Accelerated Gradient (NAG) is also available within the torch.optim.SGD class. To use it, you must provide a non-zero momentum value (similar to standard Momentum) and additionally set the boolean nesterov argument to True.
import torch
import torch.optim as optim
import torch.nn as nn
# Assume 'model' is your defined neural network
# model = nn.Linear(10, 2) # Example model definition
# Instantiate the SGD optimizer with Nesterov Momentum
learning_rate = 0.01
momentum_coefficient = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum_coefficient, nesterov=True)
# The training loop structure remains the same.
# optimizer.step() now applies the NAG update rule.
print(f"Optimizer instantiated: {optimizer}")
Setting nesterov=True modifies the update step to incorporate the "lookahead" gradient calculation characteristic of NAG. Remember that NAG requires the momentum parameter to be set to a value greater than 0; setting nesterov=True without setting momentum will not activate the Nesterov update.
While the implementation is straightforward thanks to libraries like PyTorch, selecting the right optimizer variant and tuning its hyperparameters (like lr and momentum) is important for achieving good model performance and efficient training.
momentum hyperparameter, which needs tuning alongside the learning rate.lr, momentum) as standard Momentum.Experimentation is often necessary. A common practice is to start with SGD with Momentum (or NAG) using typical hyperparameter values (e.g., lr=0.01, momentum=0.9) and then tune them based on the observed training dynamics, such as the learning curve. Visualizing the training loss curves for different optimizers or hyperparameter settings can provide valuable insight.
Illustrative comparison of training loss convergence for SGD, Momentum, and NAG on a task. Note how Momentum and NAG typically lead to faster decreases in loss compared to standard SGD. The exact behavior depends heavily on the dataset, model architecture, initialization, and hyperparameter settings.
These foundational optimizers, particularly SGD with Momentum/NAG, are still widely used and form the basis for understanding more complex adaptive methods. In the next chapter, we will explore algorithms like AdaGrad, RMSprop, and Adam, which introduce mechanisms to automatically adapt the learning rate during training.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with