Once you've used the method of least squares to fit a regression line, y^=β^0+β^1x, to your data, the immediate question is: how good is this fit? Simply finding the line that minimizes the sum of squared errors doesn't tell us the magnitude of those errors or how much predictive power the model actually has. We need quantitative measures to assess the model's performance. Several metrics are commonly used, with Mean Squared Error (MSE) and R-squared (R2) being fundamental.
Mean Squared Error (MSE)
The Mean Squared Error is a direct way to measure the average discrepancy between the observed actual values (yi) and the values predicted by the model (y^i). It represents the average of the squared differences between these values, often called squared residuals or squared errors.
Mathematically, for n data points, the MSE is calculated as:
MSE=n1∑i=1n(yi−y^i)2
Where:
yi is the actual value of the dependent variable for the i-th observation.
y^i is the predicted value of the dependent variable for the i-th observation using the regression line.
n is the total number of observations.
Interpretation:
Magnitude: A smaller MSE indicates that the model's predictions are, on average, closer to the actual values, signifying a better fit. An MSE of 0 represents a perfect fit, where all predicted values exactly match the actual values (rare in practice).
Units: The MSE is measured in the squared units of the dependent variable (y). For example, if you are predicting house prices in dollars, the MSE will be in dollars squared. This can make direct interpretation difficult.
Sensitivity: Because the errors are squared, larger errors have a disproportionately large effect on the MSE. This means the metric is sensitive to outliers; a few points far from the regression line can significantly inflate the MSE.
The least squares method itself is designed to minimize the Sum of Squared Errors (SSE), which is ∑(yi−y^i)2. The MSE is simply this sum divided by the number of data points, providing an average measure.
Root Mean Squared Error (RMSE)
To address the interpretability issue of MSE's squared units, the Root Mean Squared Error (RMSE) is often used. It's simply the square root of the MSE:
RMSE=MSE=n1∑i=1n(yi−y^i)2
Interpretation:
Magnitude: Like MSE, a lower RMSE indicates a better fit.
Units: The primary advantage of RMSE is that its units are the same as the dependent variable (y). If predicting house prices in dollars, the RMSE is also in dollars. This makes it easier to understand the typical magnitude of the prediction error in the context of the problem. For instance, an RMSE of 5000meansthemodel′spredictionsaretypicallyoffbyabout5000.
Sensitivity: RMSE is still sensitive to outliers due to the underlying squaring of errors, though less so than MSE.
The plot shows data points, a fitted regression line, and highlights one residual (yi−y^i) for x=4. MSE and RMSE are calculated based on the squared values of all such vertical distances.
R-squared (R2) - Coefficient of Determination
While MSE and RMSE provide measures of the average prediction error in absolute terms, they don't tell us the proportion of the dependent variable's variability that the model successfully captures. This is where R-squared (R2), also known as the coefficient of determination, comes in.
R2 measures how much of the total variance in the dependent variable (y) can be explained by the independent variable(s) (x) included in the model. It compares the variance of the model's errors to the total variance of the dependent variable.
The formula for R2 is:
R2=1−SStotSSres
Where:
SSres is the Sum of Squared Residuals (or Sum of Squared Errors, SSE): SSres=∑i=1n(yi−y^i)2. This represents the variance not explained by the model.
SStot is the Total Sum of Squares: SStot=∑i=1n(yi−yˉ)2, where yˉ is the mean of the actual y values. This represents the total variance in the dependent variable y.
Interpretation:
Scale:R2 typically ranges from 0 to 1.
Meaning:
An R2 of 1 indicates that the regression line perfectly fits the data, explaining 100% of the variance in y.
An R2 of 0 indicates that the model explains none of the variance in y. The model's predictions are no better than simply using the mean value yˉ as the prediction for all observations.
An R2 of 0.65 means that 65% of the total variability in the dependent variable y can be explained by its linear relationship with the independent variable(s) x in the model. The remaining 35% is unexplained by the model.
Negative Values: While less common, R2 can be negative if the chosen model fits the data worse than a horizontal line at the mean yˉ. This usually indicates a very poor model choice.
Limitations:
R2 does not indicate whether the regression model is adequate. A high R2 doesn't guarantee the model meets the underlying assumptions of linear regression. Always check residual plots.
R2 will almost always increase when more predictor variables are added to the model, even if those variables are not actually useful. This can be misleading when comparing models with different numbers of predictors. (Adjusted R-squared is a related metric often used in multiple regression to penalize the addition of unnecessary variables).
Choosing and Using Metrics
MSE, RMSE, and R2 provide different perspectives on model performance.
Use MSE or RMSE when you need to understand the typical magnitude of prediction errors in the original units of the dependent variable (RMSE is usually preferred for this due to interpretability). They are good for comparing different models on the same dataset predicting the same outcome.
Use R-squared when you want to understand the proportion of variance explained by the model, providing a relative measure of goodness-of-fit, often expressed as a percentage.
It's generally recommended to look at multiple metrics, along with visualizations like residual plots, to get a comprehensive understanding of your regression model's performance and limitations before drawing conclusions or making decisions based on it.