You've successfully calculated the loss, which quantifies how far your model's predictions are from the actual target values. The next essential step is to understand how to adjust the model's parameters (weights and biases) to reduce this loss. This is where backpropagation comes into play.
In PyTorch, the magic happens when you call the .backward()
method on the loss tensor.
# Assuming 'loss' is the scalar output from your loss function
loss.backward()
Calling loss.backward()
triggers PyTorch's automatic differentiation engine, Autograd. Remember from Chapter 3 how PyTorch builds a computation graph dynamically as operations are performed during the forward pass? This graph records the sequence of operations that led from the input data and model parameters to the final loss value.
The backward()
call initiates a traversal backward through this graph, starting from the loss tensor itself. Using the chain rule of calculus, Autograd efficiently calculates the gradient of the loss with respect to each tensor in the graph that has its requires_grad
attribute set to True
.
Specifically, for every parameter θ in your model (like the weights and biases in nn.Linear
or nn.Conv2d
layers) that contributed to the computation of the loss, loss.backward()
computes the partial derivative:
This gradient, ∂θ∂L, represents the sensitivity of the loss L to small changes in the parameter θ. It tells us the direction and magnitude of the change needed for θ to decrease the loss.
Where do these computed gradients go? PyTorch conveniently stores them directly within the .grad
attribute of the corresponding parameter tensors.
# Example: Accessing the gradient of a linear layer's weight after backpropagation
# model = nn.Linear(10, 1)
# ... (forward pass and loss calculation) ...
# loss.backward()
# Now you can inspect the gradients
print(model.weight.grad)
print(model.bias.grad)
You'll typically find non-None
values in the .grad
attributes of your model's parameters after calling loss.backward()
. Gradients for intermediate tensors within the computation graph are usually not retained to save memory, although this behavior can be modified if needed for debugging or advanced techniques. Model parameters defined via nn.Module
are considered "leaf" nodes in the graph and their gradients are preserved.
The forward pass builds the computation graph (solid lines). Calling
loss.backward()
initiates the backward pass (dotted lines), calculating gradients starting from the loss and storing them in the.grad
attribute of tensors likeW
andB
.
A significant behavior to understand is that PyTorch accumulates gradients. When you call loss.backward()
, the newly computed gradients are added to whatever value is already present in the .grad
attribute of the parameters. They do not overwrite the previous values.
This accumulation is intentional and useful in certain scenarios, such as simulating larger batch sizes or training Recurrent Neural Networks (RNNs). However, in a standard training loop where you process one batch at a time, you typically want to calculate gradients based only on the current batch. If you don't clear the gradients from the previous iteration, you'll be accumulating gradients from multiple batches, leading to incorrect parameter updates.
This is why, as mentioned in the chapter introduction and detailed later, you must explicitly zero out the gradients at the beginning of each training iteration, usually before the forward pass or just before calling loss.backward()
. The standard way to do this involves the optimizer, using optimizer.zero_grad()
.
With the gradients computed and stored in the parameters' .grad
attributes, the next logical step is to use these gradients to update the model's parameters, which is handled by the optimizer's step()
method.
© 2025 ApX Machine Learning