Building upon our understanding of dynamic computational graphs, let's examine the engine that brings them to life for gradient computation: autograd
. This system is the foundation of PyTorch's ability to automatically calculate gradients, which is essential for training neural networks via backpropagation.
At its core, autograd
performs reverse-mode automatic differentiation. When you perform operations on tensors where requires_grad
is set to True
, PyTorch constructs a graph representing the sequence of operations. This graph is built dynamically as computations happen. The autograd
engine then traverses this graph backward, starting from the final output (typically a scalar loss), to compute the gradients of that output with respect to the parameters (leaf nodes of the graph, usually model weights and biases).
backward()
CallThe process is typically initiated by calling the .backward()
method on a tensor, most commonly the scalar loss value computed at the end of a forward pass.
import torch
# Example setup
w = torch.randn(5, 3, requires_grad=True)
x = torch.randn(1, 5)
# Ensure x does not require gradients if it's just input data
# x.requires_grad_(False) # or create it without requires_grad
y = x @ w # y depends on w through matrix multiplication
z = y.mean() # z is a scalar derived from y
# Compute gradients starting from z
z.backward()
# Gradient is now populated in w.grad
# The gradient d(z)/dw is computed and stored
print(w.grad.shape)
# Output: torch.Size([5, 3])
When z.backward()
is called, autograd
works backward from z
. Since z
is a scalar, backward()
implicitly uses a starting gradient of 1.0
. This signifies ∂z∂z=1. If you call backward()
on a non-scalar tensor t
, you must provide an input gradient
argument, which should be a tensor of the same shape as t
. This input represents the gradient of some final scalar loss L with respect to the tensor t
, i.e., ∂t∂L.
Essentially, autograd
computes vector-Jacobian products (VJPs). Recall the chain rule for derivatives. If we have a scalar loss L which is a function of a vector y, L=g(y), and y is itself a function of another vector x, y=f(x), then the gradient of L with respect to x is given by:
Here, ∂x∂y is the Jacobian matrix of f with respect to x, and ∂y∂L is a row vector representing the gradient of L with respect to y. The VJP computed by y.backward(gradient=dL_dy)
is precisely the product vTJ, where v is the upstream gradient (represented by gradient=dL_dy
) and J is the Jacobian ∂x∂y. Calling z.backward()
on a scalar loss z corresponds to using an initial gradient vector v=[1.0]. This VJP approach is computationally efficient compared to explicitly forming the potentially massive Jacobian matrix.
grad_fn
Every tensor that results from an operation involving at least one tensor with requires_grad=True
will have a grad_fn
attribute. This attribute is a reference to the function object (like AddBackward0
, MulBackward0
, MmBackward0
, etc.) that created the tensor during the forward pass. Crucially, this function object stores references to its inputs and contains the implementation of its corresponding backward operation needed for gradient calculation.
Let's inspect the grad_fn
attributes in our previous example:
# We need intermediate tensors to inspect their grad_fn
w = torch.randn(5, 3, requires_grad=True)
x = torch.randn(1, 5)
y = x @ w
z = y.mean()
print(f"y originated from: {y.grad_fn}")
# Output: y originated from: <MmBackward0 object at 0x...>
print(f"z originated from: {z.grad_fn}")
# Output: z originated from: <MeanBackward0 object at 0x...>
# Leaf tensors created by the user don't have grad_fn
print(f"w.grad_fn: {w.grad_fn}")
# Output: w.grad_fn: None
print(f"x.grad_fn: {x.grad_fn}")
# Output: x.grad_fn: None
The grad_fn
objects form a directed acyclic graph (DAG) tracing the history of computations. When z.backward()
is executed:
autograd
engine begins at the target tensor z
.z.grad_fn
(which is MeanBackward0
). Using the incoming gradient (implicitly 1.0
), MeanBackward0
computes the gradient of the mean operation with respect to its input, which is `y.Let′scallthis\frac{\partial z}{\partial y}$.y
. It uses y.grad_fn
(MmBackward0
) and the incoming gradient ∂y∂z to compute the gradients of the matrix multiplication with respect to its inputs, x
and `w.Thisinvolvescalculating\frac{\partial z}{\partial y} \frac{\partial y}{\partial w}and\frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$.w
is a leaf tensor (created by the user) and has requires_grad=True
, the computed gradient ∂w∂z is accumulated into the w.grad
attribute.x
has requires_grad=False
, gradient computation along this path stops, and x.grad
remains None
.This recursive process continues, applying the chain rule backward through the graph, until all paths leading back to leaf tensors that require gradients have been processed.
Diagram of the backward pass initiated by
z.backward()
. Autograd followsgrad_fn
pointers (represented by arrows from tensors to function nodes) backward from the outputz
. The computed gradient with respect tow
is accumulated inw.grad
. Computation stops for paths involving tensors withrequires_grad=False
likex
.
A significant aspect of autograd
's behavior is that gradients are accumulated into the .grad
attribute of leaf tensors. Each time backward()
is called, the newly computed gradients for a parameter are added to the value currently stored in its .grad
attribute. If .grad
is initially None
, it's initialized with the first computed gradient.
This design choice requires explicit management during typical training loops. Before calculating the loss and performing the backward pass for a new batch of data, you must reset the gradients for all model parameters. Otherwise, the gradients from the current batch would be added to the gradients from the previous batch, leading to incorrect weight updates. The standard way to do this is using the optimizer's zero_grad()
method:
# Assume model, optimizer, criterion, dataloader are defined
for inputs, targets in dataloader:
# 1. Reset gradients from the previous iteration
optimizer.zero_grad()
# 2. Perform forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# 3. Perform backward pass to compute gradients
loss.backward()
# 4. Update model parameters using computed gradients
optimizer.step()
The accumulation behavior can be intentionally used in scenarios like implementing gradient accumulation to simulate larger batch sizes when GPU memory is limited. In such cases, you would perform multiple forward and backward passes while accumulating gradients before calling optimizer.step()
and optimizer.zero_grad()
.
PyTorch offers fine-grained control over the autograd
engine's operation:
requires_grad
Flag: This fundamental tensor attribute dictates whether operations involving the tensor should be tracked for gradient computation. Tensors created directly usually default to requires_grad=False
. Parameters within torch.nn.Module
are automatically set to requires_grad=True
. You can manually change this flag using my_tensor.requires_grad_(True)
(in-place).
torch.no_grad()
Context Manager: This is a widely used tool to disable gradient calculation within a specific block of code. Any operation performed inside a with torch.no_grad():
block will not be tracked by autograd
, even if input tensors have requires_grad=True
. This significantly reduces memory consumption and speeds up computations, making it ideal for model evaluation, inference, or any code section where gradients are unnecessary.
with torch.no_grad():
# Operations here won't be tracked
predictions = model(validation_data)
# memory usage is lower, computation is faster
torch.enable_grad()
Context Manager: Conversely, this context manager re-enables gradient tracking within its scope. It's useful if you need to compute gradients for a small part of code that happens to be inside a larger torch.no_grad()
block.
.detach()
Method: Calling tensor.detach()
creates a new tensor that shares the same underlying data storage as the original tensor but is explicitly detached from the computational graph history. The new tensor will have requires_grad=False
. Gradients will not flow back through this detached tensor to the original graph. This is useful when you need to use a tensor's value without affecting gradient calculations related to its history.
A solid grasp of these autograd
mechanics is invaluable for debugging training issues (like exploding or vanishing gradients), optimizing memory usage, understanding the behavior of complex models, and implementing custom operations or training loops effectively. While autograd
handles the complexities of differentiation automatically, knowing how it works empowers you to use PyTorch more proficiently.
© 2025 ApX Machine Learning