To evaluate machine learning models effectively, data is typically split into a training set and a test set. The training set allows the model to learn, while the test set is reserved to assess how well the model performs on data it hasn't encountered before. This process helps estimate how the model will generalize to new, real situations.
But how much data should go into the training set versus the test set? There's no single magic number that works perfectly for every situation, but there are common conventions and guidelines that provide a good starting point. The choice often involves a trade-off:
Let's look at the most frequently used split ratios:
This is perhaps the most common starting point. 80% of the data is used for training the model, and the remaining 20% is reserved for testing.
Another widely used ratio allocates 70% of the data for training and 30% for testing.
With the rise of very large datasets (think millions or billions of examples), sometimes even 10% of the data is more than enough for a reliable test set.
Common data split proportions visualized. The blue portion represents data used for training the model, and the orange portion represents data held out for testing.
The total number of samples in your dataset heavily influences the appropriate split ratio:
Regardless of the ratio, the primary goal is for both the training set and the test set to be representative of the overall data distribution. You want the patterns, variations, and potential challenges present in your full dataset to be reflected in both subsets. This is why simply taking the first 80% of your data is usually a bad idea, especially if the data has some inherent order (like time). Random shuffling before splitting, which we discuss next, is typically essential.
Choosing a split ratio is a practical decision. Starting with 80/20 or 70/30 is often reasonable. Consider your dataset size and how confident you need to be in your test results when making the final choice.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with