Once your model architecture is defined, the loss function chosen, and an optimizer configured, the next step in the training process is to actually compute the gradients and use them to update your model's weights. In TensorFlow, you're likely accustomed to using tf.GradientTape
to record operations and then tape.gradient()
to calculate gradients, followed by optimizer.apply_gradients()
. PyTorch offers a similarly explicit, yet slightly different, workflow that revolves around the loss.backward()
method and the optimizer.step()
call. Understanding this process is fundamental to writing effective PyTorch training loops.
This section touches on the specifics of how PyTorch calculates gradients using its automatic differentiation engine, autograd
, and how optimizers apply these gradients to adjust model parameters.
At the heart of every training iteration in PyTorch, after you've computed the loss, are two primary operations related to gradients and weight updates:
loss.backward()
): This computes the gradients of the loss with respect to all model parameters that have requires_grad=True
and were part of the computation graph leading to the loss. These gradients are stored in the .grad
attribute of each respective parameter tensor.optimizer.step()
): This updates the values of the model parameters using the gradients computed in the backward pass, according to the optimization algorithm (e.g., SGD, Adam) you've chosen.However, there's a small but very important preparatory step:
optimizer.zero_grad()
): Before calculating new gradients for the current batch, you must clear any gradients accumulated from previous batches.Let's explore each of these in detail.
optimizer.zero_grad()
In PyTorch, gradients are accumulated by default. This means that when loss.backward()
is called, the newly computed gradients are added to the existing values in the .grad
attribute of each parameter. While this accumulation can be useful in advanced scenarios (like implementing gradient accumulation for large batches that don't fit in memory), for typical training, you want to calculate gradients based solely on the current batch.
Therefore, the very first step inside your training loop for each batch, before the forward pass or immediately after the previous optimizer step, is to zero out the gradients.
# Assuming optimizer is an instance of torch.optim.Optimizer
optimizer.zero_grad()
If you forget this step, your gradients will be a mix from current and previous batches, leading to incorrect updates and likely preventing your model from learning effectively. Think of it as clearing the slate before each new calculation. This contrasts slightly with TensorFlow's tf.GradientTape
, where gradients are typically computed fresh for each call to tape.gradient()
, unless the tape is made persistent.
loss.backward()
After you've performed a forward pass through your model to get predictions and then calculated the loss (e.g., loss = criterion(outputs, targets)
), the next important command is:
loss.backward()
This single line triggers PyTorch's autograd
engine. Here's what happens under the hood:
requires_grad
attribute set to True
(which is default for model parameters) and participates in an operation becomes part of this graph. The loss
tensor is typically the root of this graph for the backward pass.loss.backward()
traverses this graph backward from the loss
tensor. It applies the chain rule to compute the derivative of the loss with respect to every tensor in the graph that requires_grad=True
and was an input to an operation leading to the loss.torch.nn.Parameter
instances), the computed gradient dL/dp (where L is the loss and p is the parameter) is accumulated into its .grad
attribute. For example, if layer.weight
is a parameter, after loss.backward()
, layer.weight.grad
will hold the gradient of the loss with respect to layer.weight
.If you're familiar with TensorFlow's tf.GradientTape
, loss.backward()
is analogous to calling tape.gradient(loss, model.trainable_variables)
. The key difference is that loss.backward()
directly populates the .grad
attributes of the parameters themselves, rather than returning a list of gradient tensors.
optimizer.step()
Once loss.backward()
has populated the .grad
attributes of your model's parameters, the optimizer has all the information it needs to update the weights. You trigger this update with:
optimizer.step()
The optimizer
(e.g., an instance of torch.optim.SGD
or torch.optim.Adam
) was initialized with the model's parameters (e.g., optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
). When optimizer.step()
is called, it iterates through all the parameters it was given. For each parameter p
, it uses p.grad
and its specific update rule to modify p.data
(the actual tensor values).
For example, a simple SGD optimizer would perform an update like: p.data=p.data−learning_rate×p.grad
This is similar to optimizer.apply_gradients(zip(grads, model.trainable_variables))
in TensorFlow, where grads
would be the gradients obtained from tape.gradient()
.
Here's how these steps fit into a typical training loop iteration:
# Assume:
# model: your torch.nn.Module instance
# data_loader: provides batches of (inputs, targets)
# criterion: your loss function (e.g., nn.CrossEntropyLoss())
# optimizer: your optimizer (e.g., optim.Adam(model.parameters()))
model.train() # Set the model to training mode
for inputs, targets in data_loader:
# 1. Zero out the gradients from the previous iteration
optimizer.zero_grad()
# 2. Perform a forward pass: get predictions
outputs = model(inputs)
# 3. Calculate the loss
loss = criterion(outputs, targets)
# 4. Perform a backward pass: compute gradients of the loss with respect to model parameters
loss.backward()
# 5. Update model parameters using the computed gradients
optimizer.step()
# (Optional: logging, metrics calculation, etc.)
This loop structure is the standard way to train models in PyTorch, offering a clear view of each stage of the learning process.
The sequence of operations for gradient calculation and weight updates can be visualized as follows:
The sequence of operations and their effects on model parameters within a single PyTorch training iteration. Each step prepares for or executes a part of the learning process, from clearing old gradients to applying new updates.
Transitioning from TensorFlow's GradientTape
to PyTorch's loss.backward()
and optimizer.step()
involves a few shifts in perspective:
optimizer.zero_grad()
. PyTorch's default accumulation behavior is different from the typical GradientTape
usage.loss.backward()
populates .grad
attributes directly on parameter tensors. You don't typically handle a list of gradient tensors explicitly before passing them to the optimizer, as you might with tape.gradient()
.By mastering these three core components: optimizer.zero_grad()
, loss.backward()
, and optimizer.step()
, you gain fine-grained control over the training process, which is one of PyTorch's characteristic strengths. This explicit control allows for easier debugging and implementation of more complex training schemes.
© 2025 ApX Machine Learning