Okay, let's connect the dots. We have our simple linear model, y=mx+b, and we've defined a cost function, typically the Mean Squared Error (MSE), to measure how well our line fits the data. Our goal is to find the values of m (slope) and b (y-intercept) that make this cost as small as possible.
Gradient descent needs directions. It needs to know: "If I change m slightly, how does the cost change?" and "If I change b slightly, how does the cost change?" This is precisely what partial derivatives tell us. We need to calculate the gradient of the cost function, ∇C, which consists of the partial derivatives with respect to each parameter: m and b.
The Cost Function: Mean Squared Error (MSE)
Let's assume we have N data points, (x1,y1),(x2,y2),...,(xN,yN). For any given m and b, our model predicts ypred,i=mxi+b for each input xi. The actual value is yi.
The Mean Squared Error cost function C(m,b) is defined as the average of the squared differences between the predicted and actual values:
C(m,b)=N1i=1∑N(ypred,i−yactual,i)2
Substituting our model's prediction:
C(m,b)=N1i=1∑N((mxi+b)−yi)2
This function C(m,b) takes m and b as inputs and outputs a single number representing the average error. Our task now is to find ∂m∂C and ∂b∂C.
Calculating the Partial Derivative with Respect to m (∂m∂C)
To find ∂m∂C, we differentiate the cost function C(m,b) with respect to m, treating b (and all xi, yi, and N) as constants.
Let's look at the expression term by term:
- Constant Factor: The N1 is a constant multiplier, so it stays put.
- Sum Rule: The derivative of a sum is the sum of the derivatives. We can move the derivative inside the summation:
∂m∂C=N1i=1∑N∂m∂(mxi+b−yi)2
- Chain Rule: We need to differentiate the term (mxi+b−yi)2 with respect to m. Let u=mxi+b−yi. We are differentiating u2 with respect to m. The chain rule states dmd(u2)=2u⋅∂m∂u.
- First part: 2u=2(mxi+b−yi).
- Second part: We need ∂m∂u=∂m∂(mxi+b−yi). Since b and yi are treated as constants, their derivatives are zero. The derivative of mxi with respect to m is just xi (because xi is treated as a constant coefficient of m). So, ∂m∂u=xi.
- Putting it Together: Substituting back into the chain rule formula: ∂m∂(mxi+b−yi)2=2(mxi+b−yi)⋅xi.
- Final Result for ∂m∂C: Now substitute this back into the summation:
∂m∂C=N1i=1∑N2(mxi+b−yi)xi
We can pull the constant 2 out of the sum:
∂m∂C=N2i=1∑N(mxi+b−yi)xi
This expression tells us how the cost changes as we slightly change the slope m. Notice it depends on the error term (mxi+b−yi) and the input value xi for each data point.
Calculating the Partial Derivative with Respect to b (∂b∂C)
Now we repeat the process, but this time we differentiate C(m,b) with respect to b, treating m as a constant.
- Constant Factor and Sum Rule: Same as before, we get:
∂b∂C=N1i=1∑N∂b∂(mxi+b−yi)2
- Chain Rule: Again, let u=mxi+b−yi. We need dbd(u2)=2u⋅∂b∂u.
- First part: 2u=2(mxi+b−yi).
- Second part: We need ∂b∂u=∂b∂(mxi+b−yi). This time, m, xi, and yi are treated as constants. The derivative of mxi with respect to b is 0. The derivative of b with respect to b is 1. The derivative of yi with respect to b is 0. So, ∂b∂u=1.
- Putting it Together: ∂b∂(mxi+b−yi)2=2(mxi+b−yi)⋅1.
- Final Result for ∂b∂C: Substitute back into the summation:
∂b∂C=N1i=1∑N2(mxi+b−yi)(1)
Pulling the constant 2 out:
∂b∂C=N2i=1∑N(mxi+b−yi)
This expression tells us how the cost changes as we slightly change the y-intercept b. It depends only on the error term (mxi+b−yi) for each data point.
The Gradient Vector ∇C
We've successfully calculated the partial derivatives! We can now assemble them into the gradient vector for our cost function C(m,b):
∇C(m,b)=[∂m∂C,∂b∂C]
Substituting the expressions we found:
∇C(m,b)=[N2i=1∑N(mxi+b−yi)xi,N2i=1∑N(mxi+b−yi)]
This vector, ∇C, is fundamental. For any given values of m and b:
- The first component (∂m∂C) tells us the rate of change of the cost function if we move purely in the m direction. Its sign tells us if increasing m increases or decreases the cost.
- The second component (∂b∂C) tells us the rate of change of the cost function if we move purely in the b direction. Its sign tells us if increasing b increases or decreases the cost.
Crucially, the gradient vector ∇C points in the direction of the steepest increase of the cost function C(m,b) at the current point (m,b). Since we want to minimize the cost, gradient descent will involve taking steps in the direction opposite to the gradient.
With these calculated gradients, we now have the precise information needed to iteratively update m and b to find the line that best fits our data. The next step is to see how these gradients are used within the gradient descent algorithm itself.