Okay, let's put theory into practice. You've learned that partial derivatives let us find the rate of change of a multi-variable function with respect to one variable, while holding the others constant. You also know that the gradient vector bundles these partial derivatives together, pointing in the direction of the steepest ascent of the function. Now, it's time to get comfortable with calculating these yourself.
Remember, the process relies heavily on the derivative rules you learned earlier (like the power rule and sum rule), with one extra step: treating certain variables as constants during differentiation.
Example 1: A Simple Polynomial Function
Let's start with a function of two variables, x and y:
f(x,y)=2x3+4xy2−y5+7
Our goal is to find the partial derivatives, ∂x∂f and ∂y∂f, and then form the gradient vector, ∇f(x,y).
Calculating ∂x∂f (Partial Derivative with respect to x)
To find ∂x∂f, we treat y as if it were a constant number (like 5, or -2, or π).
- Term 1 (2x3): The derivative of 2x3 with respect to x is 2⋅(3x3−1)=6x2. (Standard power rule).
- Term 2 (4xy2): Here, we treat 4 and y2 as constants multiplied by x. The derivative of (constant ×x) with respect to x is just the constant. So, the derivative of 4xy2 with respect to x is 4y2.
- Term 3 (−y5): Since we treat y as a constant, y5 is also a constant. The derivative of any constant with respect to x is 0.
- Term 4 (7): This is a constant, so its derivative with respect to x is 0.
Putting it together:
∂x∂f=6x2+4y2+0+0=6x2+4y2
Calculating ∂y∂f (Partial Derivative with respect to y)
Now, we switch perspectives. To find ∂y∂f, we treat x as if it were a constant.
- Term 1 (2x3): Since x is treated as constant, 2x3 is also constant. Its derivative with respect to y is 0.
- Term 2 (4xy2): Treat 4x as the constant coefficient of y2. The derivative of (constant ×y2) with respect to y is (constant ×2y). So, the derivative is 4x⋅(2y)=8xy.
- Term 3 (−y5): The derivative of −y5 with respect to y is −5y5−1=−5y4. (Standard power rule).
- Term 4 (7): This is a constant, so its derivative with respect to y is 0.
Putting it together:
∂y∂f=0+8xy−5y4+0=8xy−5y4
Forming the Gradient Vector ∇f(x,y)
The gradient vector is simply a vector containing the partial derivatives. By convention, we list them in the order of the variables (x, then y).
∇f(x,y)=[∂x∂f∂y∂f]=[6x2+4y28xy−5y4]
Evaluating the Gradient at a Point
The gradient itself is a function; it gives us the direction of steepest ascent at any point (x,y). Let's find the gradient at the specific point (x,y)=(1,2). We substitute x=1 and y=2 into our expressions for the partial derivatives:
- ∂x∂f at (1,2): 6(1)2+4(2)2=6(1)+4(4)=6+16=22
- ∂y∂f at (1,2): 8(1)(2)−5(2)4=16−5(16)=16−80=−64
So, the gradient vector at the point (1,2) is:
∇f(1,2)=[22−64]
This vector tells us that starting from the point (1,2) on the surface defined by f(x,y), the direction of steepest increase is primarily along the positive x-axis (value 22) and strongly along the negative y-axis (value -64). In optimization (like gradient descent), we would typically move in the opposite direction, −∇f(1,2), to decrease the function's value.
Example 2: Function with Interaction Term
Let's try another one. Consider a function g(w1,w2) which might represent a simple cost function depending on two weights, w1 and w2.
g(w1,w2)=3w12−5w1w2+2w22
Calculating ∂w1∂g
Treat w2 as a constant.
- Term 1 (3w12): Derivative w.r.t. w1 is 3(2w1)=6w1.
- Term 2 (−5w1w2): Treat −5w2 as the constant coefficient of w1. Derivative w.r.t. w1 is −5w2.
- Term 3 (2w22): Treat w2 as constant, so 2w22 is constant. Derivative w.r.t. w1 is 0.
Result:
∂w1∂g=6w1−5w2
Calculating ∂w2∂g
Treat w1 as a constant.
- Term 1 (3w12): Treat w1 as constant, so 3w12 is constant. Derivative w.r.t. w2 is 0.
- Term 2 (−5w1w2): Treat −5w1 as the constant coefficient of w2. Derivative w.r.t. w2 is −5w1.
- Term 3 (2w22): Derivative w.r.t. w2 is 2(2w2)=4w2.
Result:
∂w2∂g=−5w1+4w2
Forming the Gradient Vector ∇g(w1,w2)
∇g(w1,w2)=[∂w1∂g∂w2∂g]=[6w1−5w2−5w1+4w2]
Evaluating the Gradient at a Point
Let's evaluate the gradient at (w1,w2)=(2,−1).
- ∂w1∂g at (2,−1): 6(2)−5(−1)=12+5=17
- ∂w2∂g at (2,−1): −5(2)+4(−1)=−10−4=−14
So, the gradient vector at the point (2,−1) is:
∇g(2,−1)=[17−14]
Why Practice This?
Calculating partial derivatives and gradients is a fundamental mechanical skill needed for understanding and implementing optimization algorithms in machine learning. When we train models, we often have a cost function that depends on many parameters (weights and biases). Gradient descent uses the gradient of this cost function to iteratively update the parameters in the direction that minimizes the cost. You'll see this process in action in the next chapter.
Spend some time working through these examples and perhaps try creating and differentiating a few simple functions of your own. The more comfortable you are with these calculations, the clearer the optimization process will become.