Okay, let's extend our calculus toolkit. In the previous chapter, we saw how the derivative f′(x) measures the instantaneous rate of change of a single-variable function f(x). But machine learning models often depend on many inputs or parameters. Consider a simple cost function J(w1,w2) that depends on two weights, w1 and w2. How do we measure how J changes? Does it change faster if we adjust w1 or w2?
This is where partial derivatives come in. When a function has multiple input variables, a partial derivative measures how the function changes when one specific input variable changes, while all other input variables are held constant.
Imagine you have a function f(x,y). The partial derivative of f with respect to x tells us how f changes as we make a tiny adjustment to x, assuming y doesn't change at all. Similarly, the partial derivative of f with respect to y tells us how f changes as we tweak y, keeping x fixed.
We use a special symbol, ∂ (often called "del" or simply "partial"), to denote partial derivatives.
If the function is, say, z=f(x,y), you might also see the notation ∂x∂z and ∂y∂z.
The calculation process is quite direct:
Let's work through an example. Suppose we have the function: f(x,y)=x2+3xy+y3
Finding ∂x∂f (Partial derivative with respect to x):
We treat y as a constant.
Putting it together: ∂x∂f=2x+3y+0=2x+3y
Finding ∂y∂f (Partial derivative with respect to y):
Now, we treat x as a constant.
Putting it together: ∂y∂f=0+3x+3y2=3x+3y2
Notice that the partial derivatives are, in general, still functions of both x and y. They tell you the rate of change at a specific point (x,y) in a particular direction (either the x direction or the y direction).
For a function z=f(x,y), which describes a surface in 3D space, the partial derivatives have a nice geometric meaning.
The following visualization shows the surface z=0.5x2+y2 and highlights the curves formed by slicing the surface at x=1 and y=1, intersecting at the point (1,1,1.5). The partial derivative ∂x∂z at (1,1) gives the slope along the red curve (y=1) at that point, and ∂y∂z gives the slope along the blue curve (x=1) at that point.
The partial derivative ∂x∂f at (1,1) represents the slope of the surface along the red curve (where y=1) at the orange point. The partial derivative ∂y∂f at (1,1) represents the slope along the blue curve (where x=1) at the orange point.
Why is this important for machine learning? Consider a very simple linear model's squared error cost for a single data point (x,y): J(w,b)=(wx+b−y)2 Here, the cost J is a function of the model parameters w (weight) and b (bias). Our goal in training is usually to find the values of w and b that minimize this cost (or a sum of such costs over many data points).
To use optimization algorithms like gradient descent, we need to know how the cost J changes as we adjust w and b. We need the partial derivatives:
These partial derivatives tell us the sensitivity of the error to small changes in the weight and bias, respectively. For instance, ∂w∂J tells us how much the squared error will increase or decrease for a tiny increase in w. This information is exactly what gradient-based optimization algorithms use to update the parameters and minimize the cost function.
Understanding partial derivatives is the first step towards grasping the concept of the gradient, which combines all partial derivatives into a single vector that points in the direction of the steepest ascent of the function. We'll explore the gradient in the next section.
© 2025 ApX Machine Learning