As mentioned in the chapter introduction, training neural networks involves iteratively adjusting model parameters (weights and biases) to minimize a loss function. This adjustment relies on knowing how a small change in each parameter affects the final loss value. Mathematically, this sensitivity is captured by the gradient of the loss function with respect to each parameter. For a loss L and a parameter w, we need to compute ∂w∂L.
Calculating these gradients manually using calculus rules is feasible for very simple models, but it quickly becomes incredibly complex and error-prone for the deep, multi-layered networks common today. Imagine deriving derivatives for a model with millions of parameters! This is where Automatic Differentiation (AD) comes into play.
AD is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Unlike symbolic differentiation (which manipulates mathematical expressions, often leading to complex and inefficient formulas) or numerical differentiation (which approximates derivatives using finite differences, potentially suffering from truncation and round-off errors), AD calculates exact gradients efficiently by systematically applying the chain rule of calculus at the level of elementary operations (addition, multiplication, trigonometric functions, etc.) that make up the overall computation.
At its heart, AD relies on the chain rule. If you have a sequence of functions, say y=f(x) and z=g(y), the chain rule tells us how to find the derivative of the composite function z=g(f(x)) with respect to x:
dxdz=dydz⋅dxdyAD breaks down complex computations into a sequence of these elementary operations. It then computes the local derivatives for each small step and combines them using the chain rule to get the overall gradient.
Consider a simple example: L=(w⋅x+b)2. Let y=w⋅x+b. Then L=y2. To find ∂w∂L, the chain rule gives:
∂w∂L=∂y∂L⋅∂w∂yWe know ∂y∂L=2y and ∂w∂y=x. Substituting y back, we get:
∂w∂L=(2(w⋅x+b))⋅xAD automates this process for potentially very long chains of operations.
There are two main ways to apply the chain rule in AD:
PyTorch's Autograd system uses reverse-mode automatic differentiation.
When you perform operations on PyTorch tensors that have their requires_grad
attribute set to True
, PyTorch builds a directed acyclic graph (DAG) behind the scenes. This graph, often called the computation graph, records the sequence of operations (nodes) and the tensors (edges) involved.
Let's visualize a simple computation graph for L=(a⋅x+b)2, assuming a, x, and b are input tensors (or results of prior computations) and we want ∂a∂L, ∂x∂L, and ∂b∂L.
Conceptual representation of the computation graph for L=(a⋅x+b)2. Solid lines show the forward pass, constructing the graph. Dashed lines indicate the conceptual flow of gradients during the backward pass, applying the chain rule.
When you call .backward()
on the final output tensor (typically the scalar loss L), Autograd starts from that output and traverses the graph backward. At each step (node), it computes the gradients based on the gradient of the subsequent node and the local derivative of the operation performed at the current node, effectively applying the chain rule. The computed gradients with respect to each tensor that requires them (like model parameters) are then accumulated in their .grad
attribute.
This mechanism allows PyTorch to automatically compute gradients for arbitrarily complex models defined by sequences of tensor operations, freeing you from the tedious and error-prone task of manual derivation. The subsequent sections will demonstrate how to practically use Autograd's features: defining tensors that require gradients, building computation graphs implicitly, triggering the backward pass, accessing gradients, and controlling gradient computation.
© 2025 ApX Machine Learning