Linear regression aims to find the straight line that best represents the relationship within a dataset. But what criterion defines a "best" line? How can we objectively quantify the fit of a particular line to the data points? A method is required to measure the error, often referred to as "cost," associated with any given line. This measurement enables comparison between different potential lines and guides the learning algorithm in adjusting the line for a better fit.
Think about it this way: for any line we draw through our data, some points will be close to the line, and others might be further away. The distance between an actual data point and the point predicted by our line represents an error for that specific prediction.
Let's consider a single data point in our training set, represented by its feature(s) xi and its actual target value yi. If our current linear regression model predicts a value y^i (pronounced "y-hat") for this input xi, the error for this single prediction is simply the difference between the actual value and the predicted value:
Error = Actual Value - Predicted Value Error = yi−y^i
This difference is often called the residual. A positive residual means the prediction was too low, and a negative residual means the prediction was too high. A residual of zero means the prediction was perfect for that data point.
The vertical dashed line shows the error (residual) for one data point: the difference between the actual value (blue dot) and the value predicted by the line (point on the gray line).
We need a single number that summarizes the total error across all the data points in our training set. Simply summing the individual errors (yi−y^i) isn't very useful, because positive and negative errors could cancel each other out, giving us a misleadingly small total error even if the line is a poor fit.
A common approach is to:
This gives us the Mean Squared Error (MSE), a very popular cost function for regression problems.
The formula for MSE is:
J(θ0,θ1)=N1∑i=1N(y^i−yi)2
Let's break this down:
Sometimes, particularly in statistical contexts or other courses, you might see the formula with 1/(2N) instead of 1/N. The factor of 2 is added for mathematical convenience when calculating derivatives later (specifically for Gradient Descent), but it doesn't change the location of the minimum error. For understanding the concept, 1/N representing the mean is often clearer.
"The Mean Squared Error gives us a single, positive value representing how well our line, defined by specific parameters θ0 and θ1, fits the overall data. A perfect fit would have an MSE of 0 (though this rarely happens with data). A line that fits poorly will have a large MSE."
Therefore, the goal of our learning algorithm is to find the values of θ0 and θ1 that result in the lowest possible MSE.
Minimizing this cost function means finding the line that, on average, makes the smallest squared errors when predicting the target values in our training data. In the next section on Gradient Descent, we'll see how the algorithm systematically adjusts θ0 and θ1 to reduce the value of this cost function, effectively finding the best-fitting line.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with