As highlighted in the chapter introduction, evaluating a model solely on the data it was trained on provides an overly optimistic and often misleading picture of its capabilities. A model might achieve high accuracy on the training set simply by memorizing the data points, including noise and specific quirks, rather than learning the underlying general patterns. This phenomenon, known as overfitting, results in poor performance when the model encounters new, unseen data.
To get a more realistic estimate of how our model will perform in a real-world scenario, we need to evaluate it on data it hasn't seen during training. The most straightforward way to achieve this is by partitioning our dataset into two separate subsets:
By training on one subset and testing on another, we simulate the process of deploying the model and assessing its generalization ability, its ability to perform well on data not used during its development.
Scikit-learn provides a convenient utility function, train_test_split
, within the model_selection
module to perform this partitioning. It handles shuffling the data (important if the data has some inherent order) and splitting it into the desired proportions.
Let's see how to use it. Assume you have your features stored in a NumPy array or Pandas DataFrame called X
and your corresponding target variable (labels or values) in a NumPy array or Pandas Series called y
.
# Import the function
from sklearn.model_selection import train_test_split
import numpy as np
# Assume X and y are already defined
# For demonstration:
# X = np.arange(20).reshape(10, 2) # Example features (10 samples, 2 features)
# y = np.arange(10) # Example target variable
# Perform the split
# Let's allocate 25% of the data for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# Check the shapes of the resulting arrays (example output)
# print(f"X_train shape: {X_train.shape}") # (7, 2) if X was (10, 2)
# print(f"X_test shape: {X_test.shape}") # (3, 2) if X was (10, 2)
# print(f"y_train shape: {y_train.shape}") # (7,) if y was (10,)
# print(f"y_test shape: {y_test.shape}") # (3,) if y was (10,)
train_test_split
X
, y
: These are the arrays or dataframes containing your features and target variable, respectively. You can also pass multiple arrays if needed (e.g., if you have different feature sets).test_size
: This parameter determines the proportion of the dataset to include in the test split. It's typically a float between 0.0 and 1.0. For example, test_size=0.25
means 25% of the data will be used for testing, and the remaining 75% for training. Alternatively, you can use train_size
to specify the proportion for the training set. If only one is specified, the other is inferred. You can also pass an integer representing the absolute number of test samples.random_state
: This is an important parameter for reproducibility. The function shuffles the data before splitting by default. Setting random_state
to an integer (e.g., 42
, 0
, 123
) ensures that the same random split is generated every time you run the code. This is essential for debugging, comparing different models fairly, and ensuring others can reproduce your results. If you omit random_state
or set it to None
, you'll get a different split each time.shuffle
: A boolean parameter (defaulting to True
) indicating whether to shuffle the data before splitting. Shuffling is generally recommended to ensure that the training and test sets are representative samples, especially if the original dataset is ordered in some way (e.g., sorted by time or class).stratify
: This parameter is particularly useful for classification tasks, especially when dealing with imbalanced datasets (where some classes are much less frequent than others). By setting stratify=y
, the function ensures that the proportion of values in the target variable y
is preserved in both the training and testing sets. For example, if your target variable has 80% Class A and 20% Class B, stratification ensures both y_train
and y_test
maintain this approximate 80/20 split. Without stratification, a random split might, by chance, put almost all samples of a rare class into either the training or the testing set, leading to biased training or evaluation.Once you have split your data, the typical machine learning workflow proceeds as follows:
LinearRegression
, LogisticRegression
, KNeighborsClassifier
).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
X_train
and y_train
). The model should never see the test data during this phase.
# Fit the model ONLY on the training data
model.fit(X_train, y_train)
X_test
).
# Make predictions ONLY on the test data
y_pred = model.predict(X_test)
y_pred
) with the actual target values from the test set (y_test
) using appropriate evaluation metrics (e.g., accuracy, precision, recall for classification; MAE, MSE, R² for regression).
from sklearn.metrics import accuracy_score # Example for classification
# Evaluate the model's performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy on Test Set: {accuracy:.4f}")
This evaluation on the test set provides a much more reliable indication of how the model is likely to perform on new, unseen data compared to evaluating on the training data.
While the train-test split is a fundamental technique for model evaluation, it has a limitation: the performance estimate depends heavily on which specific data points happen to end up in the training versus the testing set. If you get a "lucky" or "unlucky" split, your evaluation metric might be overly optimistic or pessimistic. Furthermore, by holding out a portion of the data for testing, you are reducing the amount of data available for training the model.
To obtain a more robust and reliable performance estimate and make better use of your available data, we often turn to cross-validation techniques, which we will explore in the next section. Cross-validation involves multiple train-test splits, providing a more averaged and stable measure of model generalization.
© 2025 ApX Machine Learning