Once you've structured your data numerically, scaled features, and handled categorical variables, there's another essential step before you can start training your neural network: dividing your dataset appropriately. Simply training your model on all available data and then evaluating it on that same data is misleading. The model might perform exceptionally well, but this performance often stems from memorizing the training examples rather than learning general patterns. This phenomenon is known as overfitting, and it leads to poor performance when the model encounters new, unseen data.
To build models that generalize well to new data, we need a way to assess performance on examples the model hasn't seen during training. This is achieved by splitting the dataset into distinct subsets, typically three: a training set, a validation set, and a test set.
This is the largest portion of your data and the only data the model directly learns from. During the training process, the model iterates over the training set, calculating predictions, measuring the error (loss), and adjusting its internal parameters (weights and biases) via backpropagation and gradient descent to minimize this error. The goal is for the model to learn the underlying mapping from inputs to outputs present in this data.
After training the model on the training set, how do you tune its settings or decide which model architecture works best? Evaluating directly on the training set isn't helpful due to overfitting. Evaluating on the final test set during the tuning process is also problematic, as you might inadvertently tune your model to perform well specifically on that test set, leading to an inflated sense of its real-world performance.
This is where the validation set (sometimes called the development set or dev set) comes in. It's a separate portion of the data that the model does not train on. Its primary uses are:
The validation set acts as a proxy for unseen data during the model development cycle.
Once you have used the training set to train your model and the validation set to tune hyperparameters and make modeling decisions, you need a final, unbiased evaluation of how well your chosen model is likely to perform on completely new data. This is the role of the test set.
The test set is held back and used only once after all training and tuning are complete. It should represent the real-world data the model will encounter. Evaluating on the test set provides the final performance metrics (like accuracy, precision, recall, or mean squared error) that you would report for your model. It's absolutely important not to use the test set for any tuning or model selection decisions; doing so invalidates its purpose as an unbiased estimator of generalization performance. Think of it as the final exam for your model.
How much data should you allocate to each set? There are no rigid rules, and the optimal split depends on the total size of your dataset and the specific problem. Common starting points include:
For very large datasets (millions of examples), the validation and test sets can sometimes be much smaller percentages (e.g., 98% train, 1% validation, 1% test), because even 1% represents a substantial number of examples for reliable evaluation.
A visual representation of splitting the original dataset into training, validation, and test sets, highlighting their primary roles.
In practice, you rarely perform these splits manually. Libraries like Scikit-learn in Python provide convenient functions. A common approach is to first split the data into a larger training set and a smaller test set, and then split the larger training set again into a final training set and a validation set.
# Example using Scikit-learn
from sklearn.model_selection import train_test_split
# Assume X contains your features and y contains your labels/targets
# First split into training+validation (85%) and test (15%)
X_train_val, X_test, y_train_val, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y # Use stratify for classification
)
# Calculate the validation size relative to the train_val set
# If test is 15%, validation is 15%, then validation is 15 / (100-15) = 15/85 = ~0.176
val_size_relative = 0.15 / (1.0 - 0.15)
# Split train_val into final training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
X_train_val, y_train_val, test_size=val_size_relative, random_state=42, stratify=y_train_val # Use stratify again
)
print(f"Original dataset size: {len(X)}")
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")
Notice the stratify
parameter in the train_test_split
example. When dealing with classification problems, especially if certain classes are much less frequent than others (imbalanced data), it's important to ensure that the proportion of each class is roughly the same in the training, validation, and test sets as it was in the original dataset. Randomly splitting might lead to sets where some classes are over or underrepresented, potentially biasing the training or evaluation. Stratified splitting preserves the class proportions across the splits, leading to more reliable model development and evaluation. You typically stratify based on the target variable y
.
By carefully partitioning your data into these three sets, you establish a sound methodology for training your neural network, tuning its configuration, and obtaining a trustworthy measure of its ability to perform well on new, unseen data. This structured approach is fundamental to developing effective machine learning models.
© 2025 ApX Machine Learning