The derivative f′(x) measures the instantaneous rate of change of a single-variable function f(x). Many machine learning models, however, depend on multiple inputs or parameters. Consider a simple cost function J(w1,w2) that depends on two weights, w1 and w2. How do we measure how J changes? Does it change faster if we adjust w1 or w2?
This is where partial derivatives come in. When a function has multiple input variables, a partial derivative measures how the function changes when one specific input variable changes, while all other input variables are held constant.
Defining the Partial Derivative
Imagine you have a function f(x,y).
The partial derivative of f with respect to x tells us how f changes as we make a tiny adjustment to x, assuming y doesn't change at all. Similarly, the partial derivative of f with respect to y tells us how f changes as we tweak y, keeping x fixed.
Notation
We use a special symbol, ∂ (often called "del" or simply "partial"), to denote partial derivatives.
- The partial derivative of f with respect to x is written as ∂x∂f or sometimes fx.
- The partial derivative of f with respect to y is written as ∂y∂f or sometimes fy.
If the function is, say, z=f(x,y), you might also see the notation ∂x∂z and ∂y∂z.
How to Calculate Partial Derivatives
The calculation process is quite direct:
- Identify the variable you are differentiating with respect to (e.g., x).
- Treat all other independent variables as constants (e.g., treat y as if it were a number like 5 or c).
- Apply the standard differentiation rules you learned for single-variable functions.
Let's work through an example. Suppose we have the function:
f(x,y)=x2+3xy+y3
Finding ∂x∂f (Partial derivative with respect to x):
We treat y as a constant.
- The derivative of x2 with respect to x is 2x.
- The derivative of 3xy with respect to x is 3y (think of 3y as a constant coefficient multiplying x).
- The derivative of y3 with respect to x is 0 (since y is treated as a constant, y3 is also a constant, and the derivative of a constant is zero).
Putting it together:
∂x∂f=2x+3y+0=2x+3y
Finding ∂y∂f (Partial derivative with respect to y):
Now, we treat x as a constant.
- The derivative of x2 with respect to y is 0 (it's treated as a constant).
- The derivative of 3xy with respect to y is 3x (think of 3x as a constant coefficient multiplying y).
- The derivative of y3 with respect to y is 3y2.
Putting it together:
∂y∂f=0+3x+3y2=3x+3y2
Notice that the partial derivatives are, in general, still functions of both x and y. They tell you the rate of change at a specific point (x,y) in a particular direction (either the x direction or the y direction).
Geometric Interpretation
For a function z=f(x,y), which describes a surface in 3D space, the partial derivatives have a nice geometric meaning.
- ∂x∂f at a point (x0,y0) is the slope of the tangent line to the curve formed by intersecting the surface z=f(x,y) with the plane y=y0. It's the slope along the x-direction.
- ∂y∂f at a point (x0,y0) is the slope of the tangent line to the curve formed by intersecting the surface z=f(x,y) with the plane x=x0. It's the slope along the y-direction.
The following visualization shows the surface z=0.5x2+y2 and highlights the curves formed by slicing the surface at x=1 and y=1, intersecting at the point (1,1,1.5). The partial derivative ∂x∂z at (1,1) gives the slope along the red curve (y=1) at that point, and ∂y∂z gives the slope along the blue curve (x=1) at that point.
The partial derivative ∂x∂f at (1,1) represents the slope of the surface along the red curve (where y=1) at the orange point. The partial derivative ∂y∂f at (1,1) represents the slope along the blue curve (where x=1) at the orange point.
Relevance to Machine Learning
Why is this important for machine learning? Consider a very simple linear model's squared error cost for a single data point (x,y):
J(w,b)=(wx+b−y)2
Here, the cost J is a function of the model parameters w (weight) and b (bias). Our goal in training is usually to find the values of w and b that minimize this cost (or a sum of such costs over many data points).
To use optimization algorithms like gradient descent, we need to know how the cost J changes as we adjust w and b. We need the partial derivatives:
- Partial derivative with respect to w: Treat b, x, and y as constants. Use the chain rule:
∂w∂J=2(wx+b−y)1⋅∂w∂(wx+b−y)
∂w∂J=2(wx+b−y)⋅(x)
- Partial derivative with respect to b: Treat w, x, and y as constants. Use the chain rule:
∂b∂J=2(wx+b−y)1⋅∂b∂(wx+b−y)
∂b∂J=2(wx+b−y)⋅(1)
These partial derivatives tell us the sensitivity of the error to small changes in the weight and bias, respectively. For instance, ∂w∂J tells us how much the squared error will increase or decrease for a tiny increase in w. This information is exactly what gradient-based optimization algorithms use to update the parameters and minimize the cost function.
Understanding partial derivatives is the first step towards grasping the concept of the gradient, which combines all partial derivatives into a single vector that points in the direction of the steepest ascent of the function. We'll explore the gradient in the next section.