When we train a machine learning model, our ultimate goal isn't just to perform well on the data we used for training. We want the model to generalize, meaning it should make accurate predictions on new, unseen data that it didn't encounter during training. However, achieving this balance can be tricky. Two common pitfalls stand in the way: underfitting and overfitting. Understanding these issues is fundamental to evaluating and selecting effective models.
Imagine trying to draw a straight line through a set of data points that clearly follow a curve. The straight line is too simple to capture the underlying pattern. This is the essence of underfitting.
An underfit model fails to capture the important relationships between the input features and the target variable, even in the training data. It's often characterized by high bias. Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a much simpler model.
Characteristics of Underfitting:
Underfitting typically occurs when the model chosen is not complex enough for the data (e.g., using linear regression for highly non-linear data) or when the model hasn't been trained long enough.
An underfit linear model fails to capture the quadratic relationship in the data, resulting in high error on both training and potential test points.
Now, consider the opposite scenario. Imagine drawing a line that perfectly snakes through every single training data point, including any random noise or outliers. This model looks perfect on the training data, but it has essentially memorized it rather than learning the general underlying trend. This is overfitting.
An overfit model learns the training data too well. It captures not only the underlying patterns but also the noise and random fluctuations specific to the training set. These models are often characterized by high variance. Variance refers to the amount by which the model's learned function would change if we trained it on a different training dataset. A model with high variance is overly sensitive to the specific training data it saw.
Characteristics of Overfitting:
Overfitting often happens when the model is too complex relative to the amount and noisiness of the training data (e.g., using a very high-degree polynomial for a relatively simple relationship) or when trained for too long on noisy data.
An overfit model follows the noisy training data points too closely. While it has low error on these specific points, it likely deviates significantly from the true underlying trend (which might be closer to the quadratic curve shown previously) and will perform poorly on new data.
Underfitting (high bias) and overfitting (high variance) represent two extremes in model complexity. There's a fundamental tension between them:
Our goal is typically to find a sweet spot, a model complexity that achieves a good balance between bias and variance, leading to the best possible performance on unseen data. This concept is known as the Bias-Variance Tradeoff.
As model complexity increases, bias typically decreases while variance increases. The total error (a combination of bias, variance, and irreducible error) often follows a U-shape, indicating an optimal complexity level that minimizes generalization error.
Recognizing the potential for overfitting and underfitting is the first step toward building models that generalize well. Evaluating a model solely on its training performance gives a misleading picture, especially if the model is prone to overfitting. The techniques discussed next, such as train-test splits and cross-validation, are designed specifically to estimate a model's performance on unseen data and help us navigate the bias-variance tradeoff effectively.
© 2025 ApX Machine Learning