Building upon our understanding of dynamic computational graphs, let's examine the engine that brings them to life for gradient computation: autograd. This system is the foundation of PyTorch's ability to automatically calculate gradients, which is essential for training neural networks via backpropagation.At its core, autograd performs reverse-mode automatic differentiation. When you perform operations on tensors where requires_grad is set to True, PyTorch constructs a graph representing the sequence of operations. This graph is built dynamically as computations happen. The autograd engine then traverses this graph backward, starting from the final output (typically a scalar loss), to compute the gradients of that output with respect to the parameters (leaf nodes of the graph, usually model weights and biases).The backward() CallThe process is typically initiated by calling the .backward() method on a tensor, most commonly the scalar loss value computed at the end of a forward pass.import torch # Example setup w = torch.randn(5, 3, requires_grad=True) x = torch.randn(1, 5) # Ensure x does not require gradients if it's just input data # x.requires_grad_(False) # or create it without requires_grad y = x @ w # y depends on w through matrix multiplication z = y.mean() # z is a scalar derived from y # Compute gradients starting from z z.backward() # Gradient is now populated in w.grad # The gradient d(z)/dw is computed and stored print(w.grad.shape) # Output: torch.Size([5, 3])When z.backward() is called, autograd works backward from z. Since z is a scalar, backward() implicitly uses a starting gradient of 1.0. This means $\frac{\partial z}{\partial z} = 1$. If you call backward() on a non-scalar tensor t, you must provide an input gradient argument, which should be a tensor of the same shape as t. This input represents the gradient of some final scalar loss $L$ with respect to the tensor t, i.e., $\frac{\partial L}{\partial t}$.Essentially, autograd computes vector-Jacobian products (VJPs). Recall the chain rule for derivatives. If we have a scalar loss $L$ which is a function of a vector $\mathbf{y}$, $L=g(\mathbf{y})$, and $\mathbf{y}$ is itself a function of another vector $\mathbf{x}$, $\mathbf{y}=f(\mathbf{x})$, then the gradient of $L$ with respect to $\mathbf{x}$ is given by:$$ \frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} $$Here, $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is the Jacobian matrix of $f$ with respect to $\mathbf{x}$, and $\frac{\partial L}{\partial \mathbf{y}}$ is a row vector representing the gradient of $L$ with respect to $\mathbf{y}$. The VJP computed by y.backward(gradient=dL_dy) is precisely the product $\mathbf{v}^T \mathbf{J}$, where $\mathbf{v}$ is the upstream gradient (represented by gradient=dL_dy) and $\mathbf{J}$ is the Jacobian $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$. Calling z.backward() on a scalar loss $z$ corresponds to using an initial gradient vector $\mathbf{v} = [1.0]$. This VJP approach is computationally efficient compared to explicitly forming the potentially massive Jacobian matrix.Traversing the Graph with grad_fnEvery tensor that results from an operation involving at least one tensor with requires_grad=True will have a grad_fn attribute. This attribute is a reference to the function object (like AddBackward0, MulBackward0, MmBackward0, etc.) that created the tensor during the forward pass. Crucially, this function object stores references to its inputs and contains the implementation of its corresponding backward operation needed for gradient calculation.Let's inspect the grad_fn attributes in our previous example:# We need intermediate tensors to inspect their grad_fn w = torch.randn(5, 3, requires_grad=True) x = torch.randn(1, 5) y = x @ w z = y.mean() print(f"y originated from: {y.grad_fn}") # Output: y originated from: <MmBackward0 object at 0x...> print(f"z originated from: {z.grad_fn}") # Output: z originated from: <MeanBackward0 object at 0x...> # Leaf tensors created by the user don't have grad_fn print(f"w.grad_fn: {w.grad_fn}") # Output: w.grad_fn: None print(f"x.grad_fn: {x.grad_fn}") # Output: x.grad_fn: NoneThe grad_fn objects form a directed acyclic graph (DAG) tracing the history of computations. When z.backward() is executed:The autograd engine begins at the target tensor z.It accesses z.grad_fn (which is MeanBackward0). Using the incoming gradient (implicitly 1.0), MeanBackward0 computes the gradient of the mean operation with respect to its input, which is `y$. Let's call this $\frac{\partial z}{\partial y}$.The engine then moves to the next node in the graph indicated by y. It uses y.grad_fn (MmBackward0) and the incoming gradient $\frac{\partial z}{\partial y}$ to compute the gradients of the matrix multiplication with respect to its inputs, x and `w$. This involves calculating $\frac{\partial z}{\partial y} \frac{\partial y}{\partial w}$ and $\frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$.Since w is a leaf tensor (created by the user) and has requires_grad=True, the computed gradient $\frac{\partial z}{\partial w}$ is accumulated into the w.grad attribute.Since x has requires_grad=False, gradient computation along this path stops, and x.grad remains None.This recursive process continues, applying the chain rule backward through the graph, until all paths leading back to leaf tensors that require gradients have been processed.digraph G { rankdir=LR; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", fontname="Helvetica"]; edge [fontname="Helvetica", color="#495057", arrowsize=0.8]; # Nodes representing tensors z [label="z (Scalar)", fillcolor="#ffc9c9", tooltip="Result of mean(y)"]; y [label="y", fillcolor="#bac8ff", tooltip="Result of x @ w"]; w [label="w (Leaf)\nrequires_grad=True", fillcolor="#b2f2bb", shape=cylinder, margin=0.15]; x [label="x (Leaf)\nrequires_grad=False", fillcolor="#ced4da", shape=cylinder, margin=0.15]; # Nodes representing grad_fn objects MeanBackward [label="MeanBackward0", shape=ellipse, style=filled, fillcolor="#a5d8ff", tooltip="Gradient function for mean"]; MatMulBackward [label="MatMulBackward0", shape=ellipse, style=filled, fillcolor="#a5d8ff", tooltip="Gradient function for matmul"]; AccumulateGrad [label="AccumulateGrad", shape=diamond, style=filled, fillcolor="#ffec99", tooltip="Stores gradient in .grad"]; # Forward pass relationships (greyed out) subgraph cluster_forward { label = "Forward Pass"; style = "dashed"; color = "#adb5bd"; fontcolor = "#adb5bd"; w -> y [label="@", style=dashed, color="#adb5bd", fontcolor="#adb5bd"]; x -> y [label="@", style=dashed, color="#adb5bd", fontcolor="#adb5bd"]; y -> z [label="mean()", style=dashed, color="#adb5bd", fontcolor="#adb5bd"]; } # Backward pass flow (solid arrows) z -> MeanBackward [label=" grad_z=1.0", penwidth=1.5]; MeanBackward -> y [label=" grad_y", penwidth=1.5, tooltip="Gradient w.r.t. y"]; y -> MatMulBackward [label=" grad_y", penwidth=1.5]; MatMulBackward -> w [label=" grad_w", penwidth=1.5, style=dashed, tooltip="Gradient w.r.t. w"]; MatMulBackward -> x [label=" grad_x (ignored)", style=dotted, color="#adb5bd", tooltip="Gradient computation stops here"]; w -> AccumulateGrad [label=" grad_w", style=dashed, penwidth=1.5]; AccumulateGrad -> w [label=" store in w.grad", color="#f76707", fontcolor="#f76707", penwidth=1.5, constraint=false, tooltip="Gradient is summed into w.grad"]; # Link tensors to their grad_fn (invisible edges for layout) z -> MeanBackward [style=invis]; y -> MatMulBackward [style=invis]; {rank=same; y; MatMulBackward;} {rank=same; z; MeanBackward;} }Diagram of the backward pass initiated by z.backward(). Autograd follows grad_fn pointers (represented by arrows from tensors to function nodes) backward from the output z. The computed gradient with respect to w is accumulated in w.grad. Computation stops for paths involving tensors with requires_grad=False like x.Gradient AccumulationA significant aspect of autograd's behavior is that gradients are accumulated into the .grad attribute of leaf tensors. Each time backward() is called, the newly computed gradients for a parameter are added to the value currently stored in its .grad attribute. If .grad is initially None, it's initialized with the first computed gradient.This design choice requires explicit management during typical training loops. Before calculating the loss and performing the backward pass for a new batch of data, you must reset the gradients for all model parameters. Otherwise, the gradients from the current batch would be added to the gradients from the previous batch, leading to incorrect weight updates. The standard way to do this is using the optimizer's zero_grad() method:# Assume model, optimizer, criterion, dataloader are defined for inputs, targets in dataloader: # 1. Reset gradients from the previous iteration optimizer.zero_grad() # 2. Perform forward pass outputs = model(inputs) loss = criterion(outputs, targets) # 3. Perform backward pass to compute gradients loss.backward() # 4. Update model parameters using computed gradients optimizer.step()The accumulation behavior can be intentionally used in scenarios like implementing gradient accumulation to simulate larger batch sizes when GPU memory is limited. In such cases, you would perform multiple forward and backward passes while accumulating gradients before calling optimizer.step() and optimizer.zero_grad().Controlling Gradient ComputationPyTorch offers fine-grained control over the autograd engine's operation:requires_grad Flag: This fundamental tensor attribute dictates whether operations involving the tensor should be tracked for gradient computation. Tensors created directly usually default to requires_grad=False. Parameters within torch.nn.Module are automatically set to requires_grad=True. You can manually change this flag using my_tensor.requires_grad_(True) (in-place).torch.no_grad() Context Manager: This is a widely used tool to disable gradient calculation within a specific block of code. Any operation performed inside a with torch.no_grad(): block will not be tracked by autograd, even if input tensors have requires_grad=True. This significantly reduces memory consumption and speeds up computations, making it ideal for model evaluation, inference, or any code section where gradients are unnecessary.with torch.no_grad(): # Operations here won't be tracked predictions = model(validation_data) # memory usage is lower, computation is fastertorch.enable_grad() Context Manager: Conversely, this context manager re-enables gradient tracking within its scope. It's useful if you need to compute gradients for a small part of code that happens to be inside a larger torch.no_grad() block..detach() Method: Calling tensor.detach() creates a new tensor that shares the same underlying data storage as the original tensor but is explicitly detached from the computational graph history. The new tensor will have requires_grad=False. Gradients will not flow back through this detached tensor to the original graph. This is useful when you need to use a tensor's value without affecting gradient calculations related to its history.A solid grasp of these autograd mechanics is invaluable for debugging training issues (like exploding or vanishing gradients), optimizing memory usage, understanding the behavior of complex models, and implementing custom operations or training loops effectively. While autograd handles the complexities of differentiation automatically, knowing how it works helps you use PyTorch more proficiently.