Splitting data into separate sets for training and testing is a fundamental practice in machine learning. This process involves dividing a dataset into two primary portions: a training set and a testing set. A machine learning model is developed and trained using the training set. Subsequently, its performance is evaluated on the testing set, which comprises data the model has not encountered during its training phase. This approach is essential for understanding how well the model generalizes to new, unseen examples, rather than just memorizing the patterns present in the training data.
"Imagine studying for an exam. You use practice problems (training data) to learn the material. The final exam (testing data) contains questions you haven't seen before. Your score on the final exam tells you how well you truly understood the concepts, not just how well you memorized the practice questions. Similarly, the test set provides an unbiased estimate of your model's performance on future data."
Performing this split is a standard step after initial data cleaning (like handling missing values) but before you start training your model.
While you could manually slice your data, it's much more common and reliable to use functions provided by machine learning libraries. In Python, the scikit-learn library offers a convenient function called train_test_split for this exact purpose.
Let's assume you have your features stored in a variable X (perhaps a NumPy array or Pandas DataFrame) and your corresponding labels or target values in a variable y. Here’s how you'd typically use train_test_split:
# First, make sure you have scikit-learn installed
# pip install scikit-learn
# Import the function
from sklearn.model_selection import train_test_split
# Assume X contains your features and y contains your labels
# For example:
# X = dataframe[['feature1', 'feature2']]
# y = dataframe['target_label']
# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Now you have:
# X_train: Features for the training set
# X_test: Features for the testing set
# y_train: Labels for the training set
# y_test: Labels for the testing set
Let's break down the important parameters:
X, y: These are your input features and target labels. The function splits both, ensuring the correspondence between features and labels is maintained in both the training and testing sets.test_size: This parameter determines the proportion of the dataset allocated to the test set. A value of 0.2 means 20% of the data will be used for testing, and the remaining 80% for training. You could alternatively specify train_size.random_state: This is significant for reproducibility. Machine learning often involves randomness (like shuffling data before splitting). Setting random_state to a specific integer (like 42, a popular arbitrary choice) guarantees that you get the exact same split every time you run the code. This is helpful for debugging, comparing models, and sharing results. If you omit random_state, the split will be different each time.A view of splitting the original dataset (features X and labels y) into corresponding training and testing sets using a function like
train_test_split.
Consider a classification problem where you're predicting whether an email is spam or not spam. If only 5% of your emails are spam (an imbalanced dataset), a purely random split might accidentally put almost all spam emails in the training set and very few in the test set, or vice versa. This would give a misleading picture of how well the model performs on the rare class.
To handle this, we use stratified splitting. Stratification ensures that the proportion of each class in the original dataset is preserved in both the training and testing sets. In scikit-learn, you achieve this by adding the stratify parameter to train_test_split and setting it to your labels y:
# For classification, especially with imbalanced classes:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
By adding stratify=y, the function will ensure that if 5% of your original y labels were 'spam', then approximately 5% of y_train and 5% of y_test will also be 'spam'. This leads to more reliable evaluation, especially when dealing with datasets where class frequencies differ significantly.
While the train-test split is fundamental, sometimes you need another split called the validation set. This set is used during the model development process, specifically for tuning model settings (hyperparameters) before using the final test set. We won't implement this here, but it's useful to know that more complex workflows might involve splitting data into three parts: train, validate, and test. We will revisit this idea when we discuss model tuning in later chapters.
For now, mastering the train-test split is a necessary step. It ensures that when you evaluate your model after training, you are getting a fair assessment of its ability to handle new data, which is the ultimate goal of building predictive models.
Was this section helpful?
train_test_split function, detailing its usage, parameters like test_size, random_state, and stratify, along with examples.© 2026 ApX Machine LearningEngineered with