When you trained models in TensorFlow using Keras, you likely specified your optimization algorithm within the model.compile()
step, selecting from choices like tf.keras.optimizers.Adam
or tf.keras.optimizers.SGD
. PyTorch organizes its optimization algorithms within the torch.optim
package. These optimizers are the engines that drive learning, by adjusting your model's parameters (weights and biases) to minimize the loss function based on the computed gradients.
In PyTorch, you instantiate an optimizer and explicitly link it to your model's parameters. This is a slight departure from Keras, where the optimizer is associated with the model object more broadly during compilation.
To use an optimizer from torch.optim
, you first create an instance of your model (which is an nn.Module
subclass). Then, you pass the model's parameters to the optimizer's constructor, along with any algorithm-specific hyperparameters like the learning rate.
import torch
import torch.nn as nn
import torch.optim as optim
# Assume 'model' is an instance of your nn.Module subclass
# For example:
# class SimpleNet(nn.Module):
# def __init__(self):
# super(SimpleNet, self).__init__()
# self.fc1 = nn.Linear(784, 128)
# self.relu = nn.ReLU()
# self.fc2 = nn.Linear(128, 10)
#
# def forward(self, x):
# x = self.fc1(x)
# x = self.relu(x)
# x = self.fc2(x)
# return x
#
# model = SimpleNet()
# A placeholder model for optimizer instantiation examples
model = nn.Linear(10, 2) # A simple model with learnable parameters
# Instantiate an SGD optimizer
learning_rate_sgd = 0.01
momentum_sgd = 0.9
optimizer_sgd = optim.SGD(model.parameters(), lr=learning_rate_sgd, momentum=momentum_sgd)
# Instantiate an Adam optimizer
learning_rate_adam = 0.001
optimizer_adam = optim.Adam(model.parameters(), lr=learning_rate_adam)
The model.parameters()
method returns an iterator over all learnable parameters in your model. This tells the optimizer which tensors it is responsible for updating. This is a key step that connects your defined model structure with the optimization process.
As we've established, PyTorch requires you to write the training loop explicitly. The optimizer plays a central role in this loop, typically involving three distinct steps for each batch of data:
optimizer.zero_grad()
: This call resets the gradients of all model parameters that the optimizer is managing. It's crucial to do this at the start of each iteration (or before the loss.backward()
call for the current batch). PyTorch accumulates gradients by default when loss.backward()
is called multiple times. If you don't zero them out, gradients from previous batches would interfere with the current batch's gradient computation, leading to incorrect updates. TensorFlow Keras's model.fit()
handles gradient resetting automatically behind the scenes.
loss.backward()
: After computing the loss for the current batch, calling backward()
on the loss tensor computes the gradients of the loss with respect to all model parameters that have requires_grad=True
and were involved in the loss computation. These gradients are stored in the .grad
attribute of each parameter tensor.
optimizer.step()
: This method updates the values of the model parameters. It uses the gradients stored in each parameter's .grad
attribute (computed by loss.backward()
) and applies the specific optimization algorithm's update rule (e.g., SGD with momentum, Adam's update). If you've written custom training loops in TensorFlow, this is analogous to optimizer.apply_gradients()
.
Here's a diagram illustrating the optimizer's actions within a single training iteration:
The optimizer's cycle: zeroing gradients, allowing new gradients to be computed from the loss, and then applying updates to the model's parameters.
Most optimization algorithms you're familiar with from TensorFlow Keras have direct counterparts in torch.optim
. Their underlying mathematical principles are generally the same, though default hyperparameter values or naming conventions might sometimes differ slightly.
SGD is a foundational optimizer. In PyTorch, it's available as torch.optim.SGD
.
optim.SGD(params, lr=<required>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
tf.keras.optimizers.SGD(learning_rate=<required>, momentum=0.0, nesterov=False, ...)
# PyTorch SGD
optimizer_sgd_pytorch = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
The weight_decay
parameter in PyTorch's SGD
implements L2 regularization by adding a term to the loss function, effectively penalizing large weights.
Adam is a popular adaptive learning rate optimization algorithm.
optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)
tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, ...)
# PyTorch Adam
optimizer_adam_pytorch = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
Notice that PyTorch uses a tuple betas=(beta1, beta2)
while TensorFlow Keras takes beta_1
and beta_2
as separate arguments. The default epsilon
values also differ slightly. PyTorch's Adam
also includes a weight_decay
parameter for L2 regularization, which is distinct from the decoupled weight decay in AdamW.
AdamW modifies Adam by decoupling the weight decay from the gradient-based updates. This often leads to better performance and generalization than applying L2 regularization directly with Adam.
optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01, amsgrad=False)
tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.004, beta_1=0.9, beta_2=0.999, epsilon=1e-7, ...)
# PyTorch AdamW
optimizer_adamw_pytorch = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
Both frameworks provide this improved version of Adam. The default weight_decay
values might differ, so always check the documentation if you are porting exact hyperparameters.
RMSprop is another adaptive learning rate algorithm.
optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-8, weight_decay=0, momentum=0, centered=False)
tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-7, centered=False, ...)
# PyTorch RMSprop
optimizer_rmsprop_pytorch = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99, momentum=0.0)
A key parameter name difference is alpha
in PyTorch versus rho
in TensorFlow for the smoothing constant.
The following table summarizes these common optimizers and their usage patterns:
Feature/Optimizer | TensorFlow Keras (tf.keras.optimizers ) |
PyTorch (torch.optim ) |
Notes |
---|---|---|---|
Instantiation | optimizer = SGD(lr=0.01) |
optimizer = optim.SGD(model.parameters(), lr=0.01) |
PyTorch optimizers require model.parameters() at initialization. |
SGD | SGD(...) |
SGD(...) |
lr , momentum , nesterov are generally consistent. PyTorch adds weight_decay . |
Adam | Adam(beta_1, beta_2, epsilon) |
Adam(betas=(b1,b2), eps) |
Naming and some defaults for beta s and epsilon may vary. PyTorch adds weight_decay . |
AdamW | AdamW(weight_decay, ...) |
AdamW(weight_decay, ...) |
Decoupled weight decay. Recommended over Adam with L2 regularization. Default weight_decay may differ. |
RMSprop | RMSprop(rho, ...) |
RMSprop(alpha, ...) |
rho (TF) vs alpha (PyTorch) for smoothing constant. Default lr and epsilon may differ. |
Gradient Zeroing | Automatic in model.fit() ; Manual in custom loops |
Explicit optimizer.zero_grad() must be called |
PyTorch accumulates gradients by default if not cleared. |
Weight Update | Automatic in model.fit() ; opt.apply_gradients() |
Explicit optimizer.step() must be called |
Updates parameters based on their .grad attribute. |
A powerful feature of PyTorch optimizers is the ability to specify different hyperparameters for different groups of model parameters. This is done by passing a list of dictionaries to the optimizer, where each dictionary defines a parameter group and its specific options. This is particularly useful for fine-tuning, where you might want a pre-trained base model's layers to have a much smaller learning rate than a newly added classification head.
# model.feature_extractor = nn.Sequential(...)
# model.classifier = nn.Linear(...)
optimizer_param_groups = optim.SGD([
{'params': model.feature_extractor.parameters(), 'lr': 1e-4}, # Smaller LR for base
{'params': model.classifier.parameters(), 'lr': 1e-2} # Larger LR for new layers
], momentum=0.9)
If a hyperparameter is not specified in a group dictionary, it defaults to the value passed to the optimizer's constructor (e.g., momentum=0.9
in the example above applies to all groups).
Adjusting the learning rate during training is a common technique to improve convergence and final model performance. In Keras, you might use callbacks like ReduceLROnPlateau
or LearningRateScheduler
. PyTorch provides a similar mechanism through its torch.optim.lr_scheduler
module.
Schedulers are instantiated by passing an optimizer instance and scheduler-specific arguments. You then typically call scheduler.step()
after each epoch (or sometimes after each batch, depending on the scheduler type and your strategy).
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau
# Example: Using StepLR to decay learning rate by a factor of gamma every step_size epochs
# optimizer = optim.SGD(model.parameters(), lr=0.1) # Assume optimizer is already defined
scheduler_steplr = StepLR(optimizer_sgd, step_size=30, gamma=0.1)
# Example: Using ReduceLROnPlateau to reduce LR when a metric has stopped improving
# val_loss would be tracked during validation
scheduler_plateau = ReduceLROnPlateau(optimizer_adam, 'min', patience=5, factor=0.5, verbose=True)
# In your training loop, after optimizer.step():
# For epoch-based schedulers like StepLR:
# for epoch in range(num_epochs):
# # ... training phase ...
# scheduler_steplr.step()
# For metric-based schedulers like ReduceLROnPlateau:
# for epoch in range(num_epochs):
# # ... training phase ...
# val_loss = # ... validation phase ...
# scheduler_plateau.step(val_loss)
Various schedulers are available, including those for exponential decay, cosine annealing, and more complex strategies.
Transitioning from TensorFlow's optimizer handling within model.compile()
to PyTorch's explicit torch.optim
usage requires a more hands-on approach. However, this explicitness gives you finer control over the training process and a clearer view of how parameter updates occur. By understanding how to instantiate optimizers, manage their workflow with zero_grad()
, backward()
, and step()
, and leverage features like parameter groups and learning rate schedulers, you can effectively train your PyTorch models.
© 2025 ApX Machine Learning