Okay, let's put the theory into practice. We've seen how derivatives identify the slope of a function and help locate points where that slope is zero, which often correspond to minimum or maximum values. This is directly applicable to a fundamental task in machine learning: minimizing a cost function.
In machine learning, a cost function (also called a loss function or objective function) measures how well your model is performing. Specifically, it quantifies the difference between the model's predictions and the actual target values. A lower cost signifies better performance; the goal of training a model is usually to find the model parameters that result in the minimum possible cost.
For now, let's consider a very simplified scenario where our model has only one parameter, let's call it w. Imagine this parameter controls the output of our model. Our cost function, J(w), will tell us the "cost" associated with a particular value of w. Our objective is to find the value of w that minimizes J(w).
A common and mathematically convenient form for cost functions is a quadratic function. Let's define a simple cost function as:
J(w)=(w−3)2+2
Here, J(w) represents the cost for a given parameter value w. We want to find the value of w where the cost J(w) is smallest. Visually, this function is a parabola opening upwards.
The cost function J(w)=(w−3)2+2. The minimum cost occurs at w=3.
How do we find this minimum mathematically using calculus?
We calculate the first derivative of J(w) with respect to w. Using the power rule and chain rule (or simply expanding the square first), we get:
J′(w)=dwdJ=dwd((w−3)2+2) J′(w)=2(w−3)1⋅dwd(w−3)+0 J′(w)=2(w−3)⋅1 J′(w)=2w−6
This derivative, J′(w)=2w−6, tells us the slope of the cost function at any given value of w.
To find potential minima or maxima, we look for points where the slope is zero. We set the first derivative equal to zero and solve for w:
J′(w)=0 2w−6=0 2w=6 w=3
This tells us that w=3 is a critical point where the tangent line to the function is horizontal.
Is w=3 a minimum, maximum, or something else? We can use the second derivative test. Let's find the second derivative, J′′(w):
J′′(w)=dw2d2J=dwd(J′(w))=dwd(2w−6) J′′(w)=2
Now, we evaluate the second derivative at our critical point w=3:
J′′(3)=2
Since J′′(3)=2>0, the function is concave up at w=3. This confirms that w=3 corresponds to a local minimum. Because our cost function is a simple parabola opening upwards, this local minimum is also the global minimum.
This simple example illustrates the core idea behind optimizing many machine learning models:
In real machine learning problems, cost functions often depend on millions of parameters, not just one. Furthermore, finding the minimum by directly setting the derivative (or gradient, in the multivariable case) to zero can be computationally infeasible or impossible if the function is very complex.
This is why we often rely on iterative algorithms like gradient descent (which we'll explore in detail in Chapter 4). Gradient descent uses the derivative not to directly solve for the minimum, but to determine the direction in which to adjust the parameters step-by-step to gradually move towards the minimum cost. However, the fundamental concept remains the same: derivatives guide the optimization process.
Next, we will extend these ideas to functions with multiple input variables, which is essential for handling the complexity of real-world machine learning models.
© 2025 ApX Machine Learning