Okay, let's refine our understanding of what happens when a regression model makes a prediction. As mentioned in the introduction to this chapter, regression tasks involve predicting a continuous numerical value. Think about predicting things like:
In each case, the model takes some input information and produces a single number as its output, its best guess for the quantity we're interested in.
For any given input instance (like a specific house or a particular day's weather pattern), there are two values we care about:
Ideally, the predicted value y^ would be exactly equal to the actual value y. However, in practice, models are rarely perfect. There will almost always be some difference between the prediction and the reality.
This difference between the actual value and the predicted value is called the error or sometimes the residual. We calculate it simply as:
Error=Actual Value−Predicted Value e=y−y^A positive error means the model's prediction was too low. A negative error means the prediction was too high. An error of zero means the prediction was perfect for that specific instance.
Let's consider a simple example: predicting apartment rent based on square footage.
Models make errors because real-world data is complex and often contains inherent randomness or "noise". Also, models are usually simplifications of reality. They learn patterns from the training data, but these patterns might not perfectly capture every nuance needed for exact predictions on new, unseen data.
The goal of evaluating a regression model isn't necessarily to achieve zero error on every single prediction (which is often impossible), but rather to understand the typical magnitude and characteristics of these errors across a whole dataset.
The plot below illustrates this concept. The blue dots represent actual data points (e.g., actual rent vs. square footage). The orange line represents the predictions made by a simple linear regression model. The vertical gray lines show the error (or residual) for each point – the distance between the actual value (dot) and the predicted value (point on the line).
A simple linear regression model attempting to predict a target value based on a single feature. The blue dots are the actual data points, the orange line shows the model's predictions, and the dotted gray lines represent the prediction errors (residuals) for each point.
When we evaluate a model using a test set (data the model hasn't seen during training), we calculate the error for each prediction. Since we'll have many errors (one for each instance in the test set), we need ways to summarize them into meaningful performance metrics. This is where metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) come in. They provide different ways to quantify the overall accuracy and effectiveness of our regression model, which we will explore in the following sections.
© 2025 ApX Machine Learning