We've established that partial derivatives measure a function's rate of change along the axes (like moving purely east or purely north on a map) and that the gradient vector, ∇f, points in the direction of the steepest ascent (the direction to climb the hill fastest).
But what if you want to know how steep the terrain is in a different specific direction? Perhaps you want to move northeast, or along some arbitrary path represented by a vector. This is precisely what the directional derivative allows us to calculate. It quantifies the rate of change of a multivariable function f at a particular point a when moving in a direction specified by a unit vectoru.
Defining the Directional Derivative
The directional derivative of a function f at a point a in the direction of a unit vector u is denoted as Duf(a). It's calculated using the dot product of the gradient at that point and the direction vector:
Duf(a)=∇f(a)⋅u
Let's break this down:
Gradient ∇f(a): As we learned, this vector contains all the partial derivatives of f evaluated at a. For a function f(x,y), ∇f(a)=⟨∂x∂f(a),∂y∂f(a)⟩. It encapsulates the function's rate of change information in all axis directions at point a.
Unit Vector u: This vector specifies the direction of interest. It's important that u is a unit vector, meaning its length or magnitude is 1 (∣∣u∣∣=1). Why? Because we only want to capture the change due to the direction itself, not scaled by the length of the direction vector. If you have a direction specified by a vector v which is not a unit vector, you must first normalize it by dividing by its magnitude: u=∣∣v∣∣v.
Dot Product (⋅): The dot product effectively measures how much one vector "goes in the direction of" another.
Geometric Intuition: Projection
Recall that the dot product between two vectors v1 and v2 can also be expressed as v1⋅v2=∣∣v1∣∣∣∣v2∣∣cosθ, where θ is the angle between them.
Applying this to our directional derivative formula, and knowing that ∣∣u∣∣=1, we get:
Duf(a)=∇f(a)⋅u=∣∣∇f(a)∣∣∣∣u∣∣cosθ=∣∣∇f(a)∣∣cosθ
Here, θ is the angle between the gradient vector ∇f(a) and the direction vector u. This formula tells us something insightful: the directional derivative is the scalar projection of the gradient vector onto the direction vector u. It's like asking, "How much of the gradient's magnitude points in the direction u?"
The directional derivative Duf is the scalar projection of the gradient vector ∇f onto the unit direction vector u. It measures the component of the gradient acting in the direction u.
This projection view helps understand the relationship between the gradient and the directional derivative:
Maximum Change: When u points in the same direction as ∇f(a), the angle θ is 0, cosθ=1, and Duf(a)=∣∣∇f(a)∣∣. The directional derivative is maximized and equals the magnitude of the gradient. This confirms the gradient points in the direction of steepest ascent.
Minimum Change (Steepest Descent): When u points directly opposite to ∇f(a), the angle θ is π (180 degrees), cosθ=−1, and Duf(a)=−∣∣∇f(a)∣∣. This is the direction of steepest descent.
Zero Change: When u is orthogonal (perpendicular) to ∇f(a), the angle θ is π/2 (90 degrees), cosθ=0, and Duf(a)=0. Moving in this direction results in zero instantaneous change in the function's value. Geometrically, you are moving along a level curve or contour line on the function's surface.
Example Calculation
Let's consider the function f(x,y)=x2+y2, which describes a parabolic bowl centered at the origin. We want to find the rate of change at the point a=(1,1) in the direction of the vector v=⟨1,2⟩.
Calculate the gradient:∇f(x,y)=⟨∂x∂f,∂y∂f⟩=⟨2x,2y⟩.
Evaluate the gradient at a=(1,1):∇f(1,1)=⟨2(1),2(1)⟩=⟨2,2⟩. This vector points directly away from the origin, the direction of steepest ascent for this bowl shape.
Find the unit vector u for direction v=⟨1,2⟩:
Magnitude: ∣∣v∣∣=12+22=1+4=5.
Normalize: u=∣∣v∣∣v=⟨51,52⟩.
Calculate the directional derivative using the dot product:Duf(1,1)=∇f(1,1)⋅u=⟨2,2⟩⋅⟨51,52⟩Duf(1,1)=(2)(51)+(2)(52)=52+54=56.
So, at the point (1,1), if we move in the direction ⟨1,2⟩, the function f(x,y) increases at a rate of 56≈2.68 units per unit distance moved. Notice this is less than the magnitude of the gradient, ∣∣∇f(1,1)∣∣=∣∣⟨2,2⟩∣∣=22+22=8≈2.83, which is the rate of change in the steepest direction ⟨2,2⟩.
Relevance in Machine Learning
In machine learning optimization, especially with gradient descent, we are primarily interested in the direction of steepest descent, which is −∇f. However, understanding directional derivatives provides valuable context about the loss landscape. It helps us reason about why moving in the direction of the negative gradient is the most efficient step (locally) to minimize the loss function. While not typically calculated explicitly during standard gradient descent, the concept underpins our understanding of how the function behaves in the high-dimensional parameter space we navigate during model training. It reinforces the central role of the gradient in guiding optimization.