You've seen how calling $loss.backward()
computes the gradients of the loss with respect to your model's parameters (those tensors with requires_grad=True
). These gradients, representing the direction and magnitude of change needed to reduce the loss, are stored in the .grad
attribute of each parameter tensor. Next, $optimizer.step()
uses these stored gradients to update the parameter values according to the chosen optimization algorithm (like SGD or Adam).
However, there's a subtle but significant detail about how PyTorch handles gradients during the backward pass: PyTorch accumulates gradients. When you call $loss.backward()$
, the newly computed gradients for each parameter are added to whatever value is already present in that parameter's .grad
attribute.
Consider what happens over multiple iterations of the training loop if you don't address this:
$loss_1.backward()$. Gradients $\nabla_{\theta} L_1$ are computed and stored in
param.grad. Call
optimizer.step()`.$loss_2.backward()$. New gradients $\nabla_{\theta} L_2$ are computed. PyTorch *adds* these to the existing gradients, so
param.gradnow holds $\nabla_{\theta} L_1 + \nabla_{\theta} L_2$. Call
optimizer.step()`.The optimizer step in iteration 2 uses incorrect gradient information. It uses a mix of gradients from the current batch and the previous batch. This prevents the model from learning effectively, as the weight updates are based on stale and combined gradient signals from different data points.
optimizer.zero_grad()
To prevent this accumulation and ensure that the optimizer updates the weights based only on the gradients from the current batch, you must manually reset the gradients before calculating them for the next iteration. This is precisely what the optimizer.zero_grad()
method does.
Calling optimizer.zero_grad()
iterates over all the parameters (θ) that the optimizer was configured to manage and sets their .grad
attribute back to zero (or None
).
optimizer.zero_grad()
You need to call optimizer.zero_grad()
once per training iteration. The most common and recommended practice is to call it at the beginning of the loop, before processing the next batch:
# Example Training Loop Snippet
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(num_epochs):
for inputs, labels in dataloader:
# 1. Zero the gradients from the previous iteration
optimizer.zero_grad()
# 2. Forward pass: compute predicted outputs
outputs = model(inputs)
# 3. Calculate the loss
loss = criterion(outputs, labels)
# 4. Backward pass: compute gradient of the loss w.r.t. parameters
loss.backward()
# 5. Perform a single optimization step (parameter update)
optimizer.step()
# ... (Evaluation, logging, etc.) ...
Alternatively, you could call optimizer.zero_grad()
immediately after optimizer.step()
. The functional outcome in a standard loop is the same. Placing it at the start clearly demarcates the beginning of the processing for a new batch. The essential point is that it must be called before the next loss.backward()
call to avoid accumulating gradients across iterations.
Forgetting to zero the gradients is a common source of bugs in PyTorch training loops, often leading to models that don't converge or exhibit strange learning behavior. Always ensure optimizer.zero_grad()
is correctly placed within your training iteration.
While gradient accumulation is usually undesirable, it can be intentionally used as a technique to simulate larger batch sizes when GPU memory is limited. In such cases, you would perform the forward and backward passes for several mini-batches, accumulating gradients by not calling optimizer.zero_grad()
after each loss.backward()
, and only then call optimizer.step()
and optimizer.zero_grad()
after processing the desired number of mini-batches. However, for standard training, zeroing gradients every iteration is the standard procedure.
© 2025 ApX Machine Learning