Okay, we know that linear regression tries to find the best straight line through our data points. But what exactly makes a line "best"? How do we quantify how well a particular line fits the data? We need a way to measure the error, or the "cost," associated with a given line. This measurement helps us compare different possible lines and tells the learning algorithm how to adjust the line to improve its fit.
Think about it this way: for any line we draw through our data, some points will be close to the line, and others might be further away. The distance between an actual data point and the point predicted by our line represents an error for that specific prediction.
Let's consider a single data point in our training set, represented by its feature(s) xi and its actual target value yi. If our current linear regression model predicts a value y^i (pronounced "y-hat") for this input xi, the error for this single prediction is simply the difference between the actual value and the predicted value:
Error = Actual Value - Predicted Value Error = yi−y^i
This difference is often called the residual. A positive residual means the prediction was too low, and a negative residual means the prediction was too high. A residual of zero means the prediction was perfect for that data point.
The vertical dashed line shows the error (residual) for one data point: the difference between the actual value (blue dot) and the value predicted by the line (point on the gray line).
We need a single number that summarizes the total error across all the data points in our training set. Simply summing the individual errors (yi−y^i) isn't very useful, because positive and negative errors could cancel each other out, giving us a misleadingly small total error even if the line is a poor fit.
A common approach is to:
This gives us the Mean Squared Error (MSE), a very popular cost function for regression problems.
The formula for MSE is:
J(θ0,θ1)=N1∑i=1N(y^i−yi)2
Let's break this down:
Sometimes, particularly in statistical contexts or other courses, you might see the formula with 1/(2N) instead of 1/N. The factor of 2 is added for mathematical convenience when calculating derivatives later (specifically for Gradient Descent), but it doesn't change the location of the minimum error. For understanding the concept, 1/N representing the mean is often clearer.
The Mean Squared Error gives us a single, positive value representing how well our line, defined by specific parameters θ0 and θ1, fits the overall data. A perfect fit would have an MSE of 0 (though this rarely happens with real-world data). A line that fits poorly will have a large MSE.
Therefore, the goal of our learning algorithm is to find the values of θ0 and θ1 that result in the lowest possible MSE.
Minimizing this cost function means finding the line that, on average, makes the smallest squared errors when predicting the target values in our training data. In the next section on Gradient Descent, we'll see how the algorithm systematically adjusts θ0 and θ1 to reduce the value of this cost function, effectively finding the best-fitting line.
© 2025 ApX Machine Learning