When we train a machine learning model, our goal isn't just for it to perform well on the data it has already seen (the training data). We want the model to generalize well, meaning it should also make accurate predictions on new, unseen data. Think of it like studying for an exam: you don't want to just memorize the answers to the practice questions; you want to understand the concepts so you can answer different questions on the actual exam.
Two common problems prevent models from generalizing well: underfitting and overfitting. They represent two extremes in how a model learns from data.
Imagine trying to draw a straight line through data points that clearly follow a curve. The straight line is too simple to capture the underlying pattern. This is underfitting.
An underfit model is not complex enough to learn the significant trends in the training data. It performs poorly not only on the training data but also on new, unseen data (like the test set). It fails to capture the relationships between features and labels.
Now imagine drawing a wild, squiggly line that passes through every single data point in your training set, perfectly. While it looks impressive on the training data, this line has likely learned not just the underlying pattern but also the random noise and specific quirks of that particular dataset. This is overfitting.
An overfit model is too complex. It essentially memorizes the training data, including its noise, rather than learning the general pattern. When presented with new data, which won't have the exact same noise and quirks, the model performs poorly.
The ideal model lies between these extremes. It's complex enough to capture the underlying trend in the data but simple enough to avoid memorizing the noise. This model achieves a good fit and generalizes well to new data.
The following chart illustrates these three scenarios:
The scatter points represent the training data. The blue dashed line (Underfitting) is too simple. The green solid line (Good Fit) captures the general trend. The pink dotted line (Overfitting) follows the training points too closely, including noise.
Recognizing and avoiding overfitting and underfitting is a central challenge in machine learning. The techniques we use, such as splitting data into training and test sets, choosing appropriate model complexity (which relates back to parameters and hyperparameters), and using performance metrics (covered next), are all designed to help us find that "sweet spot" and build models that are genuinely useful on new data.
© 2025 ApX Machine Learning