Once you have defined your neural network architecture using torch.nn.Module
and chosen an appropriate loss function to measure the difference between your model's predictions and the actual targets, the next step is to update the model's parameters (weights and biases) to minimize this loss. This is where optimizers come into play. The torch.optim
package provides implementations of various optimization algorithms commonly used in deep learning.
Recall from the previous chapter on Autograd that calling loss.backward()
computes the gradients of the loss with respect to all model parameters that have requires_grad=True
. These gradients indicate the direction and magnitude of the change needed for each parameter to reduce the loss. However, simply computing gradients isn't enough; we need a mechanism to apply these updates. Optimizers provide this mechanism.
At its core, training a neural network is an optimization problem. We want to find the set of parameters (weights w and biases b) that minimize the loss function L. Gradient descent is the foundational algorithm for this. The basic idea is to iteratively adjust the parameters in the direction opposite to the gradient:
θnew=θold−η∇θL
Here, θ represents a parameter (like a weight or bias), ∇θL is the gradient of the loss L with respect to θ, and η (eta) is the learning rate, a hyperparameter that controls the step size.
PyTorch's torch.optim
package implements this core idea along with several more sophisticated variations designed to improve convergence speed and stability.
torch.optim
To use an optimizer in PyTorch, you first need to import the package:
import torch.optim as optim
Next, you instantiate an optimizer object. When creating it, you must tell the optimizer which parameters it should manage. Typically, you pass your model's parameters using the model.parameters()
method. You also need to specify the learning rate (lr
) and potentially other algorithm-specific hyperparameters.
# Assume 'model' is an instance of your nn.Module subclass
# Example: Using Stochastic Gradient Descent (SGD)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Example: Using Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
The model.parameters()
call returns an iterator over all the learnable parameters within your defined model
. The optimizer holds references to these tensors and knows how to update them based on their .grad
attribute, which gets populated during the loss.backward()
call.
While torch.optim
offers many algorithms, Stochastic Gradient Descent (SGD) and Adam are two of the most frequently used starting points.
SGD is a classic optimization algorithm. In its PyTorch implementation, it can operate on mini-batches of data (which is standard practice) rather than single examples. It updates parameters based on the gradient computed for the current mini-batch.
The optim.SGD
optimizer has several important arguments:
params
: The iterable of parameters to optimize (e.g., model.parameters()
).lr
: The learning rate (η). This is a critical hyperparameter. Choosing a value that's too small can lead to slow convergence, while a value that's too large can cause instability or divergence.momentum
: A technique that helps accelerate SGD in the relevant direction and dampens oscillations. It adds a fraction of the previous update vector to the current one. A typical value is 0.9.weight_decay
: Adds L2 regularization (penalty on large weights) to the loss function implicitly during the update step. This can help prevent overfitting.# SGD with momentum and weight decay
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
Adam is an adaptive learning rate optimization algorithm, meaning it computes individual learning rates for different parameters. It combines ideas from RMSprop (which adapts learning rates based on the average of recent squared gradients) and Momentum. Adam often converges faster than SGD and is relatively robust to the choice of hyperparameters, often working well with default settings.
Key arguments for optim.Adam
:
params
: The parameters to optimize.lr
: The initial learning rate (Adam adapts it internally). A common starting point is 1e-3
or 0.001
.betas
: A tuple (beta1, beta2)
controlling the exponential decay rates for the moment estimates (usually (0.9, 0.999)
).eps
: A small term added to the denominator for numerical stability (typically 1e-8
).weight_decay
: Adds L2 regularization.# Adam with default betas and specified learning rate
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Adam with custom betas and weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), weight_decay=1e-5)
Other popular optimizers like RMSprop
, Adagrad
, and AdamW
(Adam with improved weight decay handling) are also available in torch.optim
. The choice often depends on the specific problem and empirical performance.
Using an optimizer within your training loop involves two primary steps per iteration, typically performed after calculating the loss and computing gradients:
optimizer.zero_grad()
: Before computing gradients for the current mini-batch (via loss.backward()
), you must clear the gradients accumulated from the previous iteration. PyTorch accumulates gradients by default whenever backward()
is called. If you forget to zero them out, gradients from multiple batches will mix, leading to incorrect updates. This is usually called at the beginning of the loop or just before the backward()
call.
optimizer.step()
: After computing the gradients with loss.backward()
, calling optimizer.step()
updates all the parameters registered with the optimizer. It applies the specific optimization algorithm (like SGD or Adam) using the computed gradients (stored in parameter.grad
) and the learning rate.
Here's a simplified structure of a training iteration incorporating the optimizer:
# Assume model, criterion (loss function), and optimizer are defined
# Assume data_loader provides batches of inputs and targets
model.train() # Set model to training mode
for inputs, targets in data_loader:
# 1. Zero out gradients
optimizer.zero_grad()
# 2. Forward pass: compute model predictions
outputs = model(inputs)
# 3. Calculate the loss
loss = criterion(outputs, targets)
# 4. Backward pass: compute gradients
loss.backward()
# 5. Update weights
optimizer.step()
# (Optional: logging, metrics calculation, etc.)
The diagram below illustrates the role of the optimizer within the standard training cycle:
The optimizer uses the gradients computed by
loss.backward()
to update the model's parameters viaoptimizer.step()
, after ensuring previous gradients are cleared withoptimizer.zero_grad()
.
Sometimes, it's beneficial to adjust the learning rate during training. For instance, you might want to start with a larger learning rate for faster initial progress and decrease it later to fine-tune the parameters more carefully. PyTorch provides learning rate schedulers in torch.optim.lr_scheduler
for this purpose. These schedulers adjust the learning rate associated with an optimizer based on predefined rules (e.g., reducing it every few epochs or when validation performance plateaus). While powerful, detailed usage of schedulers is typically covered in more advanced contexts.
In summary, torch.optim
is an indispensable tool for training neural networks in PyTorch. By choosing an appropriate optimizer, configuring its hyperparameters like the learning rate, and correctly integrating optimizer.zero_grad()
and optimizer.step()
into your training loop, you provide the mechanism for your model to learn from data and minimize the loss function. Experimenting with different optimizers and learning rates is a standard part of developing effective deep learning models.
© 2025 ApX Machine Learning