Masterclass
Calculus, particularly differential calculus, provides the mathematical machinery for optimizing the complex functions represented by neural networks, including large language models. As mentioned earlier, training these models involves minimizing a loss function J(θ), which measures how poorly the model performs on the training data given its current parameters θ. Gradient-based optimization methods are the standard approach for this minimization, and they rely heavily on the concepts of derivatives and gradients.
For a function of a single variable, f(x), the derivative f′(x) or dxdf measures the instantaneous rate of change of the function's output with respect to its input. It tells us how much the output changes for a tiny change in the input. Geometrically, it represents the slope of the line tangent to the function's graph at point x.
Neural network loss functions, however, depend on millions or billions of parameters (weights and biases), collectively denoted by the vector θ. Therefore, we need to understand how the loss J(θ) changes when we slightly modify one specific parameter, say θi, while keeping all other parameters constant. This is precisely what a partial derivative measures, denoted as ∂θi∂J.
Consider a simple function f(x,y)=x2y. The partial derivative with respect to x treats y as a constant: ∂x∂f=2xy The partial derivative with respect to y treats x as a constant: ∂y∂f=x2
The gradient of a multivariate function, such as our loss function J(θ), is the vector containing all its partial derivatives. It is denoted by ∇J(θ) or ∇θJ(θ). If θ=(θ1,θ2,...,θn), then:
∇J(θ)=(∂θ1∂J,∂θ2∂J,...,∂θn∂J)
The gradient vector ∇J(θ) has a fundamentally important property: it points in the direction of the steepest ascent of the function J at the point θ. Conversely, the negative gradient, −∇J(θ), points in the direction of the steepest descent. This is the core insight behind gradient descent optimization. To minimize the loss, we want to move our parameters θ in the direction opposite to the gradient.
Neural networks are essentially complex, nested functions. A layer's output becomes the input to the next layer. For instance, predicting a word might involve passing input embeddings through multiple Transformer layers, each performing matrix multiplications and applying activation functions, finally ending in a probability distribution over the vocabulary calculated via a softmax function.
To compute the gradient of the final loss J with respect to parameters θ deep within the network (e.g., weights in an early layer), we need the chain rule. The chain rule allows us to compute the derivative of a composite function.
If we have a variable z that depends on y, and y in turn depends on x (i.e., z=f(y) and y=g(x)), the chain rule states how a change in x affects z:
dxdz=dydz⋅dxdy
In a neural network context, let's consider a simplified sequence: input x, layer 1 computes h=f1(x,θ1), layer 2 computes y=f2(h,θ2), and the loss is J=L(y). To find how the loss J changes with respect to parameters θ1 in the first layer, we apply the chain rule:
∂θ1∂J=∂y∂J⋅∂h∂y⋅∂θ1∂h
Backpropagation is essentially an efficient algorithm for applying the chain rule recursively, layer by layer, starting from the final loss and working backward through the network to compute the gradient of the loss with respect to all parameters.
A simplified view of dependencies in a two-layer network. Backpropagation computes gradients like ∂θ1∂J by applying the chain rule backward from J.
Once we can compute the gradient ∇θJ(θ), we can use it to update the model parameters iteratively to minimize the loss. The simplest algorithm is Gradient Descent.
Starting with an initial guess for the parameters θ0, we repeatedly update them using the following rule:
θt+1=θt−η∇θJ(θt)
Here:
This process is repeated until the loss converges to a minimum (or at least a sufficiently low value), or for a predefined number of iterations. In practice, we don't usually compute the gradient over the entire dataset (which would be Batch Gradient Descent) because datasets for LLMs are massive. Instead, we use Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent, where the gradient is estimated using only one or a small batch of training examples at each step. This introduces noise but is much more computationally efficient and often leads to better generalization.
Modern deep learning frameworks like PyTorch provide automatic differentiation (autograd). This means we define the forward pass of our network (how inputs produce outputs), and the framework automatically computes the gradients needed for the backward pass using the chain rule.
Here's a minimal PyTorch example demonstrating gradient calculation:
import torch
# Define some input tensor x and parameters w, b
# requires_grad=True tells PyTorch to track operations for gradient
# computation
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=False)
w = torch.tensor([0.5, -0.1, 0.2], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
# Define a simple linear operation (forward pass)
y = torch.dot(w, x) + b
# y = w_1*x_1 + w_2*x_2 + w_3*x_3 + b
# Define a dummy loss function (e.g., square of the output)
loss = y.square()
# Compute gradients (backward pass)
loss.backward()
# Gradients are stored in the .grad attribute of the tensors
print(f"Gradient w.r.t w: {w.grad}")
print(f"Gradient w.r.t b: {b.grad}")
# Example gradient calculation check for w_1:
# loss = (w_1*x_1 + w_2*x_2 + w_3*x_3 + b)^2
# d(loss)/dw_1 = 2 * (w_1*x_1 + w_2*x_2 + w_3*x_3 + b) * x_1
# d(loss)/dw_1 = 2 * y * x_1
y_val = (0.5 * 1.0 + (-0.1) * 2.0 + 0.2 * 3.0 + 0.1)
# 0.5 - 0.2 + 0.6 + 0.1 = 1.0
grad_w1_manual = 2 * y_val * x[0]
# 2 * 1.0 * 1.0 = 2.0
print(f"Manually computed gradient for w_1: {grad_w1_manual}")
# Matches w.grad[0]
This automatic differentiation capability allows engineers to focus on designing complex model architectures and loss functions, while the framework handles the intricate gradient calculations required for optimization. Understanding the underlying principles of gradients and the chain rule, however, remains important for designing effective models, debugging training issues (like vanishing or exploding gradients), and implementing more advanced optimization techniques discussed later in this course.
© 2025 ApX Machine Learning