You've trained a neural network, and it achieves impressive accuracy, maybe even 99%, on the data you used for training. That's great, but it's only half the story. The ultimate measure of a model's success isn't how well it remembers the training data; it's how well it performs on new, unseen data it hasn't encountered before. This ability to perform well on new inputs is known as generalization.
Think about building a spam filter. If it only learns to identify the exact spam emails used during its training, it will be useless against the new variations spammers constantly create. Similarly, a model predicting house prices needs to work for houses not in its original dataset, and a medical diagnosis system must be accurate for new patients. The core objective of most machine learning applications, particularly in deep learning, is to build models that generalize effectively.
We can think of the data we care about (e.g., all possible images of cats and dogs, all potential spam emails) as coming from some underlying, unknown data distribution. Our training dataset is just a finite sample drawn from this distribution. Our goal is to train a model that learns the true underlying patterns of this distribution, not just the specific quirks or noise present in our limited training sample.
During training, we typically monitor the model's performance on the training data itself. This is often measured using a loss function (like cross-entropy or mean squared error), which quantifies how far the model's predictions are from the true labels in the training set. Let's call the average loss over the training set the training error, Etrain.
However, minimizing Etrain is not our primary goal. What we actually want to minimize is the generalization error (also called test error), Egen, which is the expected error of the model on new, unseen data points drawn from the same underlying data distribution.
In practice, we can't directly measure the true generalization error because we don't have access to the entire data distribution. Instead, we estimate it by evaluating the trained model on a separate dataset called the test set (or sometimes a validation set). This dataset contains examples the model has never seen during training. The model's performance on this hold-out set gives us an approximation of how well it generalizes.
Ideally, a model's performance on the test set should be close to its performance on the training set. However, especially with complex deep learning models that have many parameters, it's common to see a significant difference. The model might achieve very low training error but perform much worse on the test set. This difference between training error and test error is sometimes referred to as the generalization gap.
A common pattern during training. Training loss consistently decreases, while validation loss decreases initially but then starts to increase, indicating the model is beginning to overfit the training data and its generalization performance is degrading.
Understanding why this gap occurs and how to minimize it is fundamental to building effective deep learning models. When a model fails to generalize well, it typically falls into one of two categories: underfitting or overfitting. These concepts, along with the tools to diagnose and address them, form the core focus of this course. The techniques we will cover, namely regularization and optimization methods, are designed specifically to improve a model's ability to generalize from the training data to unseen examples.
© 2025 ApX Machine Learning