Calculating the gradient of the loss function with respect to every single weight in a deep, multi-layered neural network might seem like a daunting task. If we tried to derive the derivative for each weight independently, the expressions would quickly become incredibly complex and unwieldy. Fortunately, we can approach this systematically using Computational Graphs.
A computational graph is a way to represent any mathematical expression, like the ones involved in a neural network's forward pass, as a directed graph. In this graph:
Think of it as breaking down a complex calculation into a sequence of simple, elementary steps.
Let's consider a simple example unrelated to neural networks first. Suppose we have the expression f(x,y,z)=(x+y)∗z. We can visualize this as a graph:
A computational graph for the expression f=(x+y)∗z. Circles represent variables/values, squares represent operations. Arrows show the flow of computation.
Here, x, y, and z are input nodes. The +
node takes x and y as input and outputs an intermediate value q=x+y. The *
node takes q and z as input and produces the final output f.
Computational graphs clarify two distinct phases:
Forward Pass: We start with the input values and proceed through the graph, following the direction of the edges, performing the operation at each node. This computes the value of the expression step-by-step, ultimately yielding the final output value. For our example, if x=2,y=3,z=4, the forward pass would compute q=2+3=5, and then f=5∗4=20.
Backward Pass: This is where the magic happens for gradient calculation. Starting from the final output node, we move backward through the graph (against the direction of the edges). At each node, we calculate the gradient of the final output with respect to the node's output value, using the gradients calculated from the nodes that consumed its output. This process heavily relies on the chain rule from calculus.
Let's trace the backward pass for our simple example, f=q∗z, where q=x+y. We want to find ∂x∂f, ∂y∂f, and ∂z∂f.
*
node: This node computes f=q∗z. We need the gradients with respect to its inputs, q and z.
q
node: This node receives the gradient ∂q∂f=z from the *
node.+
node: This node computes q=x+y. We need the local gradients:
So, we have systematically found: ∂x∂f=z, ∂y∂f=z, and ∂z∂f=q=x+y. Notice how the chain rule allows us to break down the overall gradient calculation into local gradients at each operation node.
Now, imagine a neural network. The entire process of calculating the output (the forward pass) can be represented as a potentially very large computational graph. Each neuron's weighted sum, each activation function application, and the final loss calculation are all nodes in this graph.
A computational graph representing a single neuron's computation: a=σ(w1x1+w2x2+b). Input variables (x1,x2), parameters (w1,w2,b), intermediate values (p1,p2,s,z), and the final output (a) are shown along with the operations (*, +, σ).
The backpropagation algorithm, which we will detail next, is essentially the process of performing the backward pass on this neural network computational graph. It starts from the final loss value and efficiently computes the gradient of the loss with respect to every single parameter (weights and biases) in the network by applying the chain rule recursively through the graph structure.
Modern deep learning frameworks like PyTorch and TensorFlow rely heavily on this concept. They automatically build a computational graph (often dynamically during the forward pass) and then perform the backward pass to compute gradients. This process is often called automatic differentiation or autodiff, and it frees us from having to manually derive and implement these complex gradient calculations. Understanding computational graphs provides insight into how these powerful tools work under the hood.
© 2025 ApX Machine Learning