While the Coefficient of Determination, or R-squared (R2), gives us a handy percentage representing the proportion of variance explained by our model, it's important to understand its limitations. Relying solely on R2 can sometimes paint an incomplete or even misleading picture of your regression model's performance.
One significant issue is that R2 almost always increases (or stays the same, but never decreases) when you add more independent variables (predictors or features) to your model. This happens even if the variables you add have no real relationship with the target variable you're trying to predict.
Think about it: adding more variables gives the model more flexibility to fit the training data, potentially capturing noise or random fluctuations rather than genuine patterns. A model with many irrelevant variables might show a high R2 on the data it was trained on, but it likely won't perform well when making predictions on new, unseen data. This encourages building overly complex models that don't generalize well.
Note: More advanced metrics like Adjusted R-squared exist, which try to penalize the score for adding variables that don't improve the model significantly. However, for this introductory course, the main takeaway is to be wary of chasing a high R2 simply by adding more inputs.
A high R2 value doesn't automatically mean your model is "good" or appropriate for your task. Here's why:
R-squared measures the strength of correlation captured by the model, not whether the relationship makes sense or if one variable causes another. You might find a high R2 between two variables that are coincidentally related or both influenced by a third, unobserved factor. It quantifies fit, not the theoretical soundness or causal validity of the model.
What counts as a "good" R2 score is highly dependent on the context of the problem:
Furthermore, R2 doesn't tell you if the prediction errors (measured by MAE or RMSE) are acceptably small for your specific application. A model could explain 90% of the variance (R2=0.9) but still have an average error (MAE) that is too large for practical use.
R-squared is a valuable metric for understanding the proportion of variance explained by your regression model. However, it shouldn't be the only metric you consider. Always evaluate it alongside error metrics like MAE, MSE, and RMSE, and use visualizations (like scatter plots of predicted vs. actual values, or residual plots) to get a more complete understanding of your model's strengths and weaknesses. Think of R2 as one important indicator among several needed to judge how well your model performs.
© 2025 ApX Machine Learning