Okay, let's put the theory into practice. We've discussed cost functions, gradients, and the basic idea of gradient descent. Now, we'll manually calculate the gradient for a simple linear regression model using a tiny dataset. This exercise will solidify your understanding of how derivatives drive the optimization process.
Setting the Scene: A Simple Problem
Imagine we have a very small dataset with just three points (x,y): (1,2), (2,3), and (3,5). Our goal is to find the best-fitting line of the form y=mx+b for these points.
Our tiny dataset containing three points, along with an initial guess for our line: y=0x+0.
We need a way to measure how "good" our line is. We'll use the Mean Squared Error (MSE) as our cost function, which we discussed earlier. For our three data points (x1,y1),(x2,y2),(x3,y3), the MSE is:
J(m,b)=31i=1∑3(yi−y^i)2=31i=1∑3(yi−(mxi+b))2
Let's start with an initial guess for our parameters: m=0 and b=0. Our line is initially y=0.
Calculating the Initial Cost
First, let's calculate the cost for our initial guess (m=0,b=0).
The predicted values (y^i=mxi+b) are:
- y^1=0(1)+0=0
- y^2=0(2)+0=0
- y^3=0(3)+0=0
The squared errors are:
- (y1−y^1)2=(2−0)2=4
- (y2−y^2)2=(3−0)2=9
- (y3−y^3)2=(5−0)2=25
The Mean Squared Error is:
J(0,0)=31(4+9+25)=338≈12.67
This is our starting cost. Our goal is to reduce this value by adjusting m and b.
Calculating the Gradients
Now for the core calculus part: finding the gradient of the cost function J(m,b). The gradient is a vector containing the partial derivatives with respect to each parameter: ∇J=[∂m∂J,∂b∂J]. These derivatives tell us how the cost changes as we slightly change m or b.
Let's find ∂m∂J and ∂b∂J. Remember, when taking a partial derivative with respect to one variable (like m), we treat other variables (like b) as constants.
Partial Derivative with respect to m (∂m∂J)
We start with the cost function:
J(m,b)=31[(y1−(mx1+b))2+(y2−(mx2+b))2+(y3−(mx3+b))2]
We differentiate term by term with respect to m. Let's focus on one term: (yi−(mxi+b))2. We use the chain rule. Let u=yi−mxi−b. Then the term is u2.
The derivative of u2 with respect to m is 2u⋅∂m∂u.
Now, we find ∂m∂u=∂m∂(yi−mxi−b). Since yi, xi, and b are treated as constants when differentiating with respect to m, this simplifies to ∂m∂(−mxi)=−xi.
So, the derivative of (yi−(mxi+b))2 with respect to m is 2(yi−mxi−b)(−xi).
Applying this to our cost function J(m,b):
∂m∂J=31i=1∑32(yi−(mxi+b))(−xi)
∂m∂J=−32i=1∑3xi(yi−(mxi+b))
Partial Derivative with respect to b (∂b∂J)
Similarly, we differentiate J(m,b) with respect to b. Again, consider one term (yi−(mxi+b))2. Let u=yi−mxi−b.
The derivative with respect to b is 2u⋅∂b∂u.
Now, ∂b∂u=∂b∂(yi−mxi−b). Treating yi, m, and xi as constants, this is ∂b∂(−b)=−1.
So, the derivative of (yi−(mxi+b))2 with respect to b is 2(yi−mxi−b)(−1).
Applying this to the cost function:
∂b∂J=31i=1∑32(yi−(mxi+b))(−1)
∂b∂J=−32i=1∑3(yi−(mxi+b))
Numerical Gradient Calculation
Now we plug our data points (1,2),(2,3),(3,5) and our current parameters m=0,b=0 into these formulas. Remember our predictions were y^1=0,y^2=0,y^3=0.
Calculating ∂m∂J:
∂m∂J=−32[x1(y1−y^1)+x2(y2−y^2)+x3(y3−y^3)]
∂m∂J=−32[1(2−0)+2(3−0)+3(5−0)]
∂m∂J=−32[1(2)+2(3)+3(5)]
∂m∂J=−32[2+6+15]=−32[23]=−346≈−15.33
Calculating ∂b∂J:
∂b∂J=−32[(y1−y^1)+(y2−y^2)+(y3−y^3)]
∂b∂J=−32[(2−0)+(3−0)+(5−0)]
∂b∂J=−32[2+3+5]=−32[10]=−320≈−6.67
So, at m=0,b=0, the gradient is ∇J≈[−15.33,−6.67].
Interpreting the Gradient
What do these numbers tell us?
- ∂m∂J≈−15.33: This is negative, meaning if we increase m slightly, the cost J will decrease. The magnitude (15.33) indicates the cost is quite sensitive to changes in m.
- ∂b∂J≈−6.67: This is also negative. Increasing b slightly will also decrease the cost J. The cost is less sensitive to changes in b compared to m at this point.
The gradient [−15.33,−6.67] points in the direction of the steepest increase in cost. To decrease the cost (which is our goal in optimization), we need to move in the opposite direction of the gradient. This means we should increase both m and b.
Taking a Small Step (Gradient Descent)
Let's perform one step of gradient descent. We need a learning rate, α. Let's choose a small value, say α=0.01.
The update rules are:
- mnew=mold−α∂m∂J
- bnew=bold−α∂b∂J
Plugging in our values (mold=0,bold=0, α=0.01, ∂m∂J=−46/3, ∂b∂J=−20/3):
- mnew=0−(0.01)(−346)=0.01×346=30.46≈0.153
- bnew=0−(0.01)(−320)=0.01×320=30.20≈0.067
Our new parameters are approximately m≈0.153 and b≈0.067.
Let's visualize the new line y≈0.153x+0.067:
Our data points along with the initial line (dashed red) and the line after one gradient descent step (solid green). The new line is slightly closer to the data points.
If we were to calculate the cost J(0.153,0.067), we would find it's lower than our initial cost of 12.67 (as calculated in the thought process, it's around 10.03). By repeating this process - calculating gradients and updating parameters - gradient descent iteratively finds better values for m and b, minimizing the cost function.
This manual calculation shows the mechanism: derivatives (gradients) tell us the direction to adjust parameters (m and b) to improve our model by reducing the cost function (MSE). While libraries automate this, understanding the underlying calculation is fundamental to grasping how machine learning models learn.