Okay, we've established that the derivative, f′(x), tells us the slope of a function f(x) at any given point x. Think about what it means if the slope is zero. Visually, on a graph, a zero slope corresponds to a flat tangent line. Where do these usually occur? They often happen at the very bottom of a valley (a minimum) or the very peak of a hill (a maximum) on the function's graph. This insight is fundamental to optimization.
Points where the derivative f′(x) is equal to zero, or where f′(x) is undefined, are called critical points. These points are the primary candidates for local minima and maxima. A local minimum is a point lower than all its immediate neighbors, while a local maximum is a point higher than its immediate neighbors.
Why focus on f′(x)=0? Because if a function is smooth and differentiable, the only way it can transition from increasing (positive slope) to decreasing (negative slope), forming a peak, or from decreasing to increasing, forming a valley, is by passing through a point where the slope is momentarily zero.
Consider the function f(x)=x2. Its derivative is f′(x)=2x. Setting f′(x)=0 gives 2x=0, so x=0 is the only critical point. We know intuitively that x=0 is the minimum of the parabola y=x2.
The function f(x)=x2 has a minimum at x=0. At this point, the tangent line is horizontal, indicating its slope, f′(0), is 0.
Knowing a point is critical isn't enough. Is it a minimum, a maximum, or something else (like a flat spot in an otherwise increasing function, e.g., f(x)=x3 at x=0)? The First Derivative Test helps us classify these points by looking at how the slope changes around the critical point.
The logic is straightforward:
Example: Let's analyze f(x)=x3−3x. The derivative is f′(x)=3x2−3. Set f′(x)=0: 3x2−3=0 3(x2−1)=0 3(x−1)(x+1)=0 The critical points are x=1 and x=−1.
Now, let's test the sign of f′(x)=3(x−1)(x+1) in the intervals defined by these points: (−∞,−1), (−1,1), and (1,∞).
Analysis based on sign changes at the critical points:
The sign of the first derivative f′(x) tells us if the function is increasing or decreasing. Changes in sign at critical points (x=−1,x=1) reveal local maxima and minima.
Calculating derivatives can sometimes be easier than testing intervals around critical points. The Second Derivative Test provides an alternative way to classify critical points, leveraging the function's concavity at those points.
Recall that the second derivative, f′′(x), is the derivative of the first derivative f′(x). It tells us the rate of change of the slope.
How does this relate to minima and maxima? Consider a critical point c where the tangent line is horizontal, meaning f′(c)=0.
Example: Let's revisit f(x)=x3−3x. We previously found f′(x)=3x2−3 and critical points x=1,x=−1. Now, find the second derivative by differentiating f′(x): f′′(x)=dxd(3x2−3)=6x
Let's evaluate f′′(x) at the critical points:
These conclusions match those from the First Derivative Test, and in this case, the second derivative was very simple to compute and evaluate.
The sign of the second derivative f′′(x) indicates concavity. At critical points where f′(x)=0, positive concavity (f′′>0) implies a local minimum, while negative concavity (f′′<0) implies a local maximum. Where f′′(x)=0 (at x=0 here), the concavity changes, defining an inflection point.
Why have we spent this time finding minima and maxima? Because this is the core mathematical idea behind training many machine learning models. Models learn by trying to minimize a cost function (also called a loss function or objective function). This function quantifies how well the model's predictions match the actual target values in the training data. A high cost means poor performance. A low cost means the model is doing well.
The process of training involves adjusting the model's internal parameters (like the slope and intercept in linear regression, or the weights and biases in neural networks) to find the set of parameter values that results in the minimum possible value of the cost function.
In simple cases, like some forms of linear regression, we might be able to find this minimum analytically by setting the derivative (or gradient, in higher dimensions) of the cost function to zero and solving, just like we found critical points here. However, for most complex models, especially in deep learning, the cost function landscape is incredibly complicated. We can't just solve f′(x)=0. Instead, we use iterative algorithms like gradient descent (which we will cover in detail in Chapter 4). These algorithms use the derivative (gradient) at the current position to figure out which direction is "downhill" towards a minimum and take a small step in that direction, repeating the process many times.
Understanding how the first and second derivatives identify minima and maxima for single-variable functions is the essential building block for understanding these powerful optimization techniques used throughout machine learning. You now have the calculus tools to analyze the shape of simple functions and pinpoint their low points, directly mirroring the goal of minimizing a cost function. Next, we'll put this into practice by optimizing a basic cost function and using Python tools to help with the calculations.
© 2025 ApX Machine Learning