Automatic Differentiation: The Core Mechanism

Training machine learning models, especially deep neural networks, often comes down to an optimization problem: finding model parameters that minimize a loss function. The most common way to solve such optimization problems is through gradient-based methods, like gradient descent. These methods require the derivative (or gradient, in the case of multiple parameters) of the loss function with respect to the model parameters. Automatic Differentiation (AD) is a powerful technique that provides an efficient and accurate way to compute these derivatives.

AD is not the only way to get derivatives. You might be familiar with:

Symbolic Differentiation: This is what you often do by hand or with computer algebra systems (like SymPy or Mathematica). It involves applying differentiation rules to mathematical expressions to derive a new expression for the derivative. While exact, symbolic differentiation can lead to very complex expressions ("expression swell"), especially for complicated functions, which can be slow to evaluate.
Numerical Differentiation: This method approximates derivatives using finite differences, for example, $f'(x) \approx \frac{f(x+h) - f(x)}{h}$ $f^{'} (x) \approx \frac{f ( x + h ) - f ( x )}{h}$ for a small $h$ $h$ . While easy to implement, it suffers from two main issues:
- Truncation error: Caused by the approximation formula itself (terms ignored from the Taylor series).
- Round-off error: Caused by the limited precision of floating-point numbers.
- It's also computationally expensive for functions with many inputs (like neural network parameters), as it requires at least $N+1$ function evaluations to compute the gradient for $N$ variables.

Automatic Differentiation, on the other hand, computes derivatives of a function specified as a computer program. It does so by breaking down the computation into a sequence of elementary arithmetic operations (addition, multiplication, etc.) and elementary functions (sin, cos, exp, log, etc.). AD then applies the chain rule of calculus repeatedly to these operations, accumulating the derivative. The main advantage is that AD computes derivatives to machine precision, just like symbolic differentiation, but it tends to be more computationally efficient for complex functions and a large number of variables, especially when using its "reverse mode."

The Core Idea: Computational Graphs and the Chain Rule

Fundamentally, AD views any function computed by a program as a composition of elementary operations. These operations can be visualized as a computational graph, where nodes represent intermediate variables or operations, and edges represent data flow.

For example, a simple function: $y = f(x_1, x_2) = \ln(x_1) + x_1 x_2 - \sin(x_2)$ . We can break this down into a sequence of operations:

$v_1 = \ln(x_1)$
$v_2 = x_1 \cdot x_2$
$v_3 = \sin(x_2)$
$v_4 = v_1 + v_2$
$y = v_5 = v_4 - v_3$

This sequence can be represented by the following graph:

A computational graph for $y = \ln(x_1) + x_1 x_2 - \sin(x_2)$ . Input nodes $x_1, x_2$ feed into intermediate operations, ultimately producing output $y$ .

AD traverses this graph, applying the chain rule at each step. There are two primary modes for doing this: forward mode and reverse mode.

Forward Mode AD

In forward mode, AD computes derivatives by propagating them forward through the computational graph, alongside the evaluation of the function itself. For each elementary operation, it computes both the value and its derivative with respect to an input variable.

If we want to compute $\frac{\partial y}{\partial x_1}$ , forward mode would track the derivative of each intermediate variable $v_i$ with respect to $x_1$ , denoted as $\dot{v_i} = \frac{\partial v_i}{\partial x_1}$ . The rules are:

$\dot{x_1} = 1$ , $\dot{x_2} = 0$ (if differentiating w.r.t $x_1$ ).
If $u = c \cdot v$ , then $\dot{u} = c \cdot \dot{v}$ .
If $u = v + w$ , then $\dot{u} = \dot{v} + \dot{w}$ .
If $u = v \cdot w$ , then $\dot{u} = \dot{v} \cdot w + v \cdot \dot{w}$ (product rule).
If $u = g(v)$ , then $\dot{u} = g'(v) \cdot \dot{v}$ (chain rule).

Forward mode is efficient if you have few input variables and many output variables, or if you only need the derivative with respect to one input variable at a time. To get the full gradient for a function with $N$ inputs, you would typically need to run the forward pass $N$ times, each time seeding the derivative of a different input variable to 1 and others to 0.

Reverse Mode AD

Reverse mode AD, often called backpropagation in the context of neural networks, computes derivatives by propagating them backward from the final output through the graph. It consists of two phases:

Forward Pass: The original function is evaluated from inputs to output. The values of all intermediate variables ( $v_i$ ) are computed and stored.
Backward Pass: The derivatives (called "adjoints" or "sensitivities") are computed starting from the output. The adjoint of a variable $v_i$ , denoted $\bar{v_i}$ , is the derivative of the final output $y$ with respect to $v_i$ : $\bar{v_i} = \frac{\partial y}{\partial v_i}$ . The process starts with $\bar{y} = \frac{\partial y}{\partial y} = 1$ . Then, for each node $v_j$ that was an input to an operation producing $v_k$ , its adjoint $\bar{v_j}$ is updated based on $\bar{v_k}$ and the local partial derivative $\frac{\partial v_k}{\partial v_j}$ using the chain rule: $\bar{v_j} = \sum_{k \text{ where } v_j \text{ is input to } v_k} \bar{v_k} \frac{\partial v_k}{\partial v_j}$ .

Reverse mode is exceptionally efficient for functions with many input variables (like the millions of parameters in a deep neural network) and a single scalar output (like a loss function). It allows computing the entire gradient (derivatives with respect to all inputs) in roughly the same computational cost as a few evaluations of the original function, regardless of the number of inputs. This is why it's the foundation for training most deep learning models.

Automatic Differentiation in Julia

Julia's language features, such as its dynamic type system, multiple dispatch, and metaprogramming capabilities, make it an excellent platform for developing powerful and flexible AD systems. The compiler's ability to specialize code based on types allows AD tools to often achieve performance comparable to hand-written derivative code.

Several packages provide AD capabilities in Julia. Some prominent ones include:

ForwardDiff.jl: Implements forward mode AD. It's often very fast for functions with a small number of inputs or for calculating Jacobians of functions with few inputs and many outputs. It typically works by overloading arithmetic operations on a special "dual number" type.
Zygote.jl: A source-to-source AD system primarily focused on reverse mode. It works by transforming Julia code itself to generate the code for its gradient. Zygote.jl is the foundation of Flux.jl (Julia's primary deep learning library) and is designed to be highly flexible and work well with Julia's features.
ReverseDiff.jl: Provides reverse mode AD, often by "taping" operations (recording them on a computational graph) and then performing a backward pass on this tape.
FiniteDifferences.jl: While not strictly AD, it provides tools for numerical differentiation, which can be useful for testing gradients obtained from AD systems.

You don't need to implement AD yourself. These packages allow you to obtain gradients of your Julia functions, often with minimal changes to your existing code. For example, to get the derivative of $f(x) = x^3 + 2x$ at $x=3$ using ForwardDiff.jl, you might write:

# This is a brief example. Package installation and detailed usage
# will be covered in "Setting Up Your Julia Deep Learning Environment"
# and subsequent chapters.

# import Pkg; Pkg.add("ForwardDiff") # If not already installed
using ForwardDiff

f(x) = x^3 + 2x
x_val = 3.0

# Calculate the derivative of f at x_val
df_dx = ForwardDiff.derivative(f, x_val)

println("Function: f(x) = x^3 + 2x")
println("Value of x: $x_val")
println("Derivative f'(x) at x = $x_val: $df_dx") # Expected: 3*x^2 + 2 = 3*(3^2) + 2 = 27 + 2 = 29

This simple example demonstrates how readily AD tools can be applied. In the context of deep learning, libraries like Flux.jl integrate AD (primarily through Zygote.jl) so that gradients required for training neural networks are computed automatically behind the scenes. This allows you to focus on defining your model architecture and training process, while the AD system handles the complex calculus.

Understanding the principles of automatic differentiation is valuable as you progress in machine learning, as it helps in debugging, performance optimization, and even in designing custom model components when needed. As we move into building neural networks, you'll see AD in action, quietly and efficiently powering the learning process.

Was this section helpful?

References

Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Andreas Griewank, Andrea Walther, 2008 (SIAM) DOI: 10.1137/1.9780898717761 - A comprehensive textbook on Automatic Differentiation, covering its theoretical foundations, modes (forward and reverse), and applications.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A standard textbook in deep learning, with a detailed explanation of backpropagation as an application of reverse-mode automatic differentiation.
ForwardDiff.jl Documentation, Jarrod Taylor and contributors, 2025 - Official documentation for ForwardDiff.jl, providing insights into its forward-mode AD implementation based on dual numbers and its usage.