Okay, let's revisit our simple linear model from the previous section: y=mx+b. We have some data points, perhaps like pairs of (study hours, exam score), and we want to find the line that best represents the relationship between these pairs. The question is, what does "best" actually mean in mathematical terms?
Imagine you have a handful of data points plotted on a graph. You can draw many different lines through these points. Some lines will pass very close to the points, while others will be quite far off. We need a way to measure how well a specific line, defined by a particular slope m and intercept b, fits our data. This measure is what we call a cost function (or sometimes a loss function or objective function).
The core idea is to quantify the "error" or "mistake" our line makes for each data point. Let's say we have n data points: (x1,y1),(x2,y2),...,(xn,yn). For any given input xi, our current line y=mx+b predicts an output value. Let's call this prediction y^i (pronounced "y-hat"):
y^i=mxi+bThe actual output value for this data point is yi. The difference between the actual value and the predicted value, yi−y^i, tells us how far off our prediction was for that specific point. This difference is often called the residual.
If the actual value yi is larger than the prediction y^i, the residual is positive. If the prediction is larger than the actual value, the residual is negative. If we just add up these residuals across all data points, the positive and negative errors might cancel each other out, giving us a misleadingly small total error even if the line is a poor fit.
To avoid this cancellation and ensure that larger errors contribute more to our total measure of "badness", we typically square each residual: (yi−y^i)2. Squaring makes all errors positive and penalizes larger deviations more significantly than smaller ones (e.g., an error of 2 becomes 4, while an error of 3 becomes 9).
Now that we have a measure of error for each individual point, we need a single number to represent the overall error of our line across all n data points. A very common approach is to calculate the average of these squared errors. This gives us the Mean Squared Error (MSE) cost function.
Mathematically, we define the Mean Squared Error cost function, which we'll call J, as a function of our model parameters m and b:
J(m,b)=n1i=1∑n(yi−y^i)2Let's break this down:
We can substitute the definition of y^i directly into the cost function formula:
J(m,b)=n1i=1∑n(yi−(mxi+b))2This formula gives us a single number, J, that tells us how badly our line y=mx+b fits the data on average. A smaller value of J means the line's predictions are, on average, closer to the actual data points, indicating a better fit. A larger value means a poorer fit.
This plot shows sample data points (blue dots), a possible linear model (red line), and the vertical errors (residuals, dotted gray lines) between the actual data points and the model's predictions. The MSE cost function calculates the average of the squares of the lengths of these dotted lines.
Our optimization goal is now concrete: find the specific values of m and b that result in the lowest possible value for J(m,b). Minimizing this cost function means finding the line that minimizes the average squared vertical distance to our data points.
Why choose Mean Squared Error? Besides penalizing larger errors more, it turns out that MSE has favorable mathematical properties. Specifically, it's a smooth, continuous function, and importantly, we can easily calculate its derivatives with respect to m and b. As we saw in previous chapters, these derivatives (the gradient) are exactly what we need to guide the gradient descent algorithm towards the minimum cost.
In the next section, we'll take this cost function J(m,b) and apply the calculus we've learned to calculate its gradient. This will be the engine that drives our optimization process.
© 2025 ApX Machine Learning