In the previous section, we saw how K-Fold cross-validation helps provide a more robust estimate of model performance than a single train-test split. However, standard K-Fold distributes data into folds randomly. While this works well for many scenarios, it can sometimes lead to problematic splits, especially in classification tasks with imbalanced datasets.
Imagine a binary classification problem where 90% of your samples belong to Class A and only 10% belong to Class B. If you use standard K-Fold with, say, 10 folds, it's possible purely by chance that some folds might contain very few, or even zero, instances of the minority class (Class B). Training a model on a fold lacking Class B examples means the model won't learn to identify that class. Similarly, evaluating on a test fold that happens to have a disproportionately high or low number of Class B samples can give an unreliable performance score. This variance in class distribution across folds can lead to unstable and misleading evaluation results.
To address this, we use Stratified K-Fold. Stratification ensures that each fold's class distribution is approximately the same as that of the entire dataset. If your dataset has 90% Class A and 10% Class B, Stratified K-Fold will ensure that each of the k folds created maintains this 90/10 split (or as close as possible given the number of samples).
Instead of purely random assignment, Stratified K-Fold first groups the data by class label. Then, it samples instances from each class proportionally to create the folds. This guarantees that the class representation is preserved in both the training and validation subsets generated for each iteration of the cross-validation process.
Consider this comparison for a small, imbalanced dataset:
In this hypothetical 2-fold split of 10 samples (9 Class A, 1 Class B), standard K-Fold might randomly place the single Class B sample entirely in Split 2. Stratified K-Fold ensures each split receives a proportional number of samples from each class (as close as possible).
Scikit-learn makes using Stratified K-Fold straightforward. You can use the StratifiedKFold
cross-validator explicitly or rely on functions like cross_val_score
which often use stratification automatically for classification tasks.
Let's see how to use StratifiedKFold
explicitly and then integrate it with cross_val_score
.
First, we need some example data and a classifier:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=2,
n_redundant=10, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=42)
print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
# Define the classifier
classifier = LogisticRegression(solver='liblinear', random_state=42)
Now, let's explicitly create a StratifiedKFold
object and see how it generates indices:
# Initialize Stratified K-Fold
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
# Iterate through the splits and show indices (optional: just for illustration)
print(f"\nGenerating {n_splits} stratified folds:")
fold_num = 1
for train_index, test_index in skf.split(X, y):
print(f"Fold {fold_num}:")
# print(f" Train indices: {train_index[:10]}...") # Optional: view indices
# print(f" Test indices: {test_index}") # Optional: view indices
print(f" Train size: {len(train_index)}, Test size: {len(test_index)}")
# Verify class distribution in the test set for this fold
y_test_fold = y[test_index]
print(f" Test fold class distribution: {np.bincount(y_test_fold)}")
fold_num += 1
You'll notice that each test fold maintains approximately the same ratio of classes as the original dataset (around 90% and 10%).
The most common way to use Stratified K-Fold is by passing it (or simply the number of splits, letting Scikit-learn choose stratification) to cross_val_score
:
# Method 1: Pass the StratifiedKFold object directly
scores_explicit_skf = cross_val_score(classifier, X, y, cv=skf, scoring='accuracy')
print(f"\nAccuracy scores using explicit StratifiedKFold ({n_splits}-fold): {scores_explicit_skf}")
print(f"Mean Accuracy: {scores_explicit_skf.mean():.4f} (+/- {scores_explicit_skf.std() * 2:.4f})")
# Method 2: Pass an integer - cross_val_score often defaults to StratifiedKFold for classifiers
# Note: For reproducibility when relying on the default, ensure the estimator has random_state set if applicable,
# and consider that the internal shuffling might differ if not explicitly controlled via a cv object.
scores_implicit_skf = cross_val_score(classifier, X, y, cv=n_splits, scoring='accuracy')
print(f"\nAccuracy scores using implicit stratification ({n_splits}-fold): {scores_implicit_skf}")
print(f"Mean Accuracy: {scores_implicit_skf.mean():.4f} (+/- {scores_implicit_skf.std() * 2:.4f})")
Both methods achieve the same goal: evaluating the classifier using cross-validation where each fold preserves the overall class balance.
When to Use Stratified K-Fold:
For regression problems, standard K-Fold is typically sufficient, as the concept of "class balance" doesn't directly apply. However, for classification, using Stratified K-Fold provides a more reliable and stable estimate of your model's performance on unseen data, helping you make better decisions during model selection and hyperparameter tuning.
© 2025 ApX Machine Learning