Okay, you've trained your first linear regression model using Scikit-learn. It ran without errors, and you even looked at the coefficients using model.coef_ and model.intercept_. But how good is the model, really? Simply fitting a model isn't enough; we need to objectively measure its performance. This is where evaluation metrics come in.
For regression tasks, where we predict continuous values, evaluation metrics quantify how close our model's predictions (y^) are to the actual target values (y). The difference between an actual value and a predicted value (yi−y^i) is called the residual or error for that data point. Regression metrics are typically based on aggregating these residuals across the entire dataset (or a specific subset like a test set).
Let's look at some of the most common metrics used to evaluate regression models.
Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) is one of the simplest and most intuitive metrics. It measures the average absolute difference between the predicted values and the actual values.
The formula for MAE is:
MAE=n1∑i=1n∣yi−y^i∣
where:
n is the number of data points.
yi is the actual value for the i-th data point.
y^i is the predicted value for the i-th data point.
∣...∣ denotes the absolute value.
Interpretation:
Units: MAE is in the same units as the target variable (e.g., if you're predicting house prices in dollars, the MAE will also be in dollars).
Scale: MAE ranges from 0 to ∞. A value closer to 0 indicates better model performance, meaning the average prediction error is small.
Sensitivity to Outliers: Because it uses the absolute difference, MAE is less sensitive to large errors (outliers) compared to metrics that square the errors. It treats all errors linearly based on their magnitude.
MAE gives a straightforward average error magnitude. If your model has an MAE of 10, it means, on average, its predictions are off by 10 units from the actual values.
Mean Squared Error (MSE)
The Mean Squared Error (MSE) is another widely used metric. Instead of taking the absolute difference, it calculates the average of the squared differences between predictions and actual values.
The formula for MSE is:
MSE=n1∑i=1n(yi−y^i)2
Interpretation:
Units: MSE is in the square of the units of the target variable (e.g., dollars squared for house prices). This makes it harder to interpret directly in the context of the original problem.
Scale: MSE ranges from 0 to ∞, with values closer to 0 being better.
Sensitivity to Outliers: Squaring the errors gives much more weight to larger errors. A prediction that is off by 10 units contributes 100 to the sum, while one off by 2 units contributes only 4. This means MSE heavily penalizes models that make large mistakes.
Mathematical Properties: The squared term makes MSE differentiable, which is advantageous for certain mathematical optimization techniques used in training some models.
While less intuitive due to its squared units, MSE's sensitivity to large errors can be desirable if those errors are particularly costly or important to avoid.
Root Mean Squared Error (RMSE)
The Root Mean Squared Error (RMSE) is simply the square root of the MSE. It addresses the primary interpretation issue of MSE by bringing the metric back into the original units of the target variable.
The formula for RMSE is:
RMSE=MSE=n1∑i=1n(yi−y^i)2
Interpretation:
Units: RMSE is in the same units as the target variable, similar to MAE.
Scale: RMSE ranges from 0 to ∞, with lower values indicating a better fit.
Sensitivity to Outliers: Like MSE, RMSE penalizes large errors more heavily than MAE due to the squaring step before the square root. However, its interpretation is often easier than MSE because the units match the target.
RMSE is arguably the most common regression metric. It provides a measure of the typical magnitude of the errors, expressed in the original units, while still being sensitive to larger deviations. If your RMSE is 15, it suggests your model's predictions are typically around 15 units away from the actual values, with larger errors having had a disproportionately larger influence on the final metric compared to MAE.
R-squared (R2) Score (Coefficient of Determination)
Unlike the previous metrics which measure error magnitude, the R-squared (R2) score, also known as the coefficient of determination, measures the proportion of the variance in the target variable that is explained by the model's features.
It compares the model's performance (using the sum of squared errors, SSE, which is n×MSE) to that of a baseline model that simply predicts the mean of the target variable (yˉ) for all inputs (whose error is the total sum of squares, SST).
The formula for R2 is:
R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2=1−SSTSSE
where yˉ=n1∑i=1nyi is the mean of the actual values.
Interpretation:
Units:R2 is a unitless metric, representing a proportion.
Scale:
An R2 of 1 indicates that the model perfectly predicts the target variable (explains 100% of the variance).
An R2 of 0 indicates that the model performs no better than the baseline model that always predicts the mean of the target values.
An R2 score can be negative. This happens when the model performs worse than the baseline mean model, meaning the model's predictions are actively harmful compared to just guessing the average.
Context:R2 provides a relative measure of fit. A "good" R2 value is context-dependent. In some fields (like physics), high R2 values (e.g., > 0.95) are expected. In others (like social sciences), an R2 of 0.3 might be considered informative. It doesn't tell you the average error magnitude, only how much better your model is than a naive mean prediction, relative to the total variance.
Choosing the Right Metric
The choice of evaluation metric depends on your specific goals and the characteristics of your data:
Use MAE if you want an easily interpretable metric in the original units and if outliers are not a major concern or should not be disproportionately penalized.
Use RMSE if you want a metric in the original units but need to penalize larger errors more significantly. It's often a good default choice.
Use MSE primarily when its mathematical properties are advantageous (less common for final evaluation due to unit interpretation difficulty).
Use R2 when you need to understand the proportion of variance explained by the model, providing a relative measure of goodness-of-fit. It's useful for comparing models' explanatory power but doesn't directly indicate the typical prediction error size.
Often, it's beneficial to look at multiple metrics to get a more complete picture of your model's performance. For instance, a high R2 indicates your model captures the trend well, while a low MAE or RMSE confirms the typical prediction error is small.
In the next section, we'll see how to easily compute these metrics using functions provided by Scikit-learn.