Imagine you have a dataset, like a collection of flashcards you want to use to study for an exam. To properly prepare, you wouldn't just memorize the flashcards and then test yourself on the exact same cards. You'd study one set and then test yourself on a different set you haven't seen before to see if you truly learned the material.
The training set is like that first set of flashcards you study from. It's the specific portion of your overall dataset that you feed directly into your machine learning algorithm. Its primary purpose is to allow the model to learn.
What does "learning" mean here? During the training phase, the model examines the training data, looking for patterns, relationships, and structures.
Essentially, the model adjusts its internal parameters based on the examples provided in the training set. It tries to find a mathematical representation or a set of rules that connects the input features to the known output labels or values present in this training data. Think of it as the model building its understanding of the problem based only on this specific subset of data.
This set typically contains both:
The model uses both X and y in the training set to figure out the mapping f such that f(X) approximates y.
Because the model explicitly uses the training set to adjust itself and learn these patterns, its performance measured on this same data can be misleadingly high. The model might simply memorize the training examples, including their noise and specific quirks, rather than learning the underlying general pattern. This is why we absolutely need a separate set of data, the test set (which we'll discuss next), to get an honest assessment of how well the model will perform on new, unseen data.
Generally, the training set constitutes the larger fraction of your total data (common splits are 70% or 80% for training), as providing the model with more examples usually helps it learn more robust patterns.
© 2025 ApX Machine Learning