In Chapter 2, we discussed the conceptual importance of splitting your data into separate sets for training and testing. Now, as we focus on the practical steps of preparing data, let's revisit how to actually perform this essential split. Remember, the goal is to train our machine learning model on one portion of the data (the training set) and then evaluate its performance on a completely separate portion that it hasn't seen before (the testing set). This helps us understand how well the model generalizes to new, unseen examples, rather than just memorizing the training data.
Imagine studying for an exam. You use practice problems (training data) to learn the material. The final exam (testing data) contains questions you haven't seen before. Your score on the final exam tells you how well you truly understood the concepts, not just how well you memorized the practice questions. Similarly, the test set provides an unbiased estimate of your model's performance on future, real-world data.
Performing this split is a standard step after initial data cleaning (like handling missing values) but before you start training your model.
While you could manually slice your data, it's much more common and reliable to use functions provided by machine learning libraries. In Python, the scikit-learn
library offers a convenient function called train_test_split
for this exact purpose.
Let's assume you have your features stored in a variable X
(perhaps a NumPy array or Pandas DataFrame) and your corresponding labels or target values in a variable y
. Here’s how you'd typically use train_test_split
:
# First, make sure you have scikit-learn installed
# pip install scikit-learn
# Import the function
from sklearn.model_selection import train_test_split
# Assume X contains your features and y contains your labels
# For example:
# X = dataframe[['feature1', 'feature2']]
# y = dataframe['target_label']
# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Now you have:
# X_train: Features for the training set
# X_test: Features for the testing set
# y_train: Labels for the training set
# y_test: Labels for the testing set
Let's break down the important parameters:
X, y
: These are your input features and target labels. The function splits both, ensuring the correspondence between features and labels is maintained in both the training and testing sets.test_size
: This parameter determines the proportion of the dataset allocated to the test set. A value of 0.2
means 20% of the data will be used for testing, and the remaining 80% for training. You could alternatively specify train_size
.random_state
: This is significant for reproducibility. Machine learning often involves randomness (like shuffling data before splitting). Setting random_state
to a specific integer (like 42
, a popular arbitrary choice) guarantees that you get the exact same split every time you run the code. This is helpful for debugging, comparing models, and sharing results. If you omit random_state
, the split will be different each time.A conceptual view of splitting the original dataset (features X and labels y) into corresponding training and testing sets using a function like
train_test_split
.
Consider a classification problem where you're predicting whether an email is spam or not spam. If only 5% of your emails are spam (an imbalanced dataset), a purely random split might accidentally put almost all spam emails in the training set and very few in the test set, or vice versa. This would give a misleading picture of how well the model performs on the rare class.
To handle this, we use stratified splitting. Stratification ensures that the proportion of each class in the original dataset is preserved in both the training and testing sets. In scikit-learn
, you achieve this by adding the stratify
parameter to train_test_split
and setting it to your labels y
:
# For classification, especially with imbalanced classes:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
By adding stratify=y
, the function will ensure that if 5% of your original y
labels were 'spam', then approximately 5% of y_train
and 5% of y_test
will also be 'spam'. This leads to more reliable evaluation, especially when dealing with datasets where class frequencies differ significantly.
While the train-test split is fundamental, sometimes you need another split called the validation set. This set is used during the model development process, specifically for tuning model settings (hyperparameters) before using the final test set. We won't implement this here, but it's useful to know that more complex workflows might involve splitting data into three parts: train, validate, and test. We will revisit this idea when we discuss model tuning in later chapters.
For now, mastering the train-test split is a necessary step. It ensures that when you evaluate your model after training, you are getting a fair assessment of its ability to handle new data, which is the ultimate goal of building predictive models.
© 2025 ApX Machine Learning