While splitting data into a single training set and a testing set provides a basic check against overfitting, the resulting performance estimate can be sensitive to which data points end up in the training versus the test set. A particularly "lucky" or "unlucky" split might give a misleadingly optimistic or pessimistic view of the model's true generalization ability. Cross-validation offers a more robust approach by systematically creating multiple train-test splits and averaging the results. K-Fold cross-validation is one of the most common methods.
In K-Fold cross-validation, the original dataset is partitioned into K roughly equal-sized, non-overlapping subsets called "folds". The process then iterates K times. In each iteration i (from 1 to K):
After completing all K iterations, we have K evaluation scores. The final cross-validation score is typically reported as the average of these K scores. This average provides a more stable and reliable estimate of the model's performance on unseen data compared to a single train-test split because every data point gets to be in a validation set exactly once and in a training set K−1 times. Common choices for K are 5 or 10.
Here's a conceptual diagram illustrating 5-Fold Cross-Validation:
Flow of 5-Fold Cross-Validation. The data is split into 5 folds. In each iteration, one fold serves as the validation set, and the others form the training set. Scores from each iteration are averaged.
Scikit-learn provides tools to easily implement K-Fold cross-validation. Let's first see how to use the KFold
class from sklearn.model_selection
to generate the indices for the splits, and then we'll look at a more convenient helper function.
Assume you have your features X
(e.g., a NumPy array or Pandas DataFrame) and target variable y
(NumPy array or Pandas Series).
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression # Example model
from sklearn.metrics import accuracy_score # Example metric
from sklearn.datasets import load_iris # Example dataset
# Load sample data
iris = load_iris()
X, y = iris.data, iris.target
# 1. Initialize KFold
# We choose K=5 folds.
# shuffle=True is recommended to randomize the data order before splitting.
# random_state ensures reproducibility of the shuffle.
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
fold_accuracies = []
model = LogisticRegression(max_iter=1000) # Instantiate the model once
# 2. Loop through the splits generated by KFold
print(f"Running {k}-Fold Cross-Validation...")
for fold, (train_index, val_index) in enumerate(kf.split(X)):
# 3. Get training and validation sets for this fold
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
# 4. Train the model on the training data for this fold
model.fit(X_train, y_train)
# 5. Evaluate on the validation data for this fold
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
fold_accuracies.append(accuracy)
print(f" Fold {fold+1}: Validation Accuracy = {accuracy:.4f}")
# 6. Calculate average performance and standard deviation
mean_accuracy = np.mean(fold_accuracies)
std_accuracy = np.std(fold_accuracies)
print(f"\nCross-Validation Results ({k}-Fold):")
print(f" Individual Fold Accuracies: {[f'{acc:.4f}' for acc in fold_accuracies]}")
print(f" Average Validation Accuracy: {mean_accuracy:.4f}")
print(f" Standard Deviation of Accuracy: {std_accuracy:.4f}")
# Expected Output:
# Running 5-Fold Cross-Validation...
# Fold 1: Validation Accuracy = 1.0000
# Fold 2: Validation Accuracy = 0.9667
# Fold 3: Validation Accuracy = 0.9333
# Fold 4: Validation Accuracy = 0.9333
# Fold 5: Validation Accuracy = 0.9667
#
# Cross-Validation Results (5-Fold):
# Individual Fold Accuracies: ['1.0000', '0.9667', '0.9333', '0.9333', '0.9667']
# Average Validation Accuracy: 0.9600
# Standard Deviation of Accuracy: 0.0249
In this code:
KFold
with n_splits=5
, shuffle=True
, and random_state=42
. Shuffling is generally a good idea unless your data has inherent sequential order that you need to preserve. random_state
makes the shuffle predictable, which is important for getting reproducible results.kf.split(X)
method generates pairs of index arrays (train_index
, val_index
) for each fold.X
and y
into the appropriate training and validation sets for the current fold.LogisticRegression
model and evaluate its accuracy_score
on the validation set.cross_val_score
for ConvenienceManually looping through folds is instructive, but Scikit-learn provides the cross_val_score
function, which performs the entire K-Fold cross-validation process (splitting, training, evaluating) much more concisely.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load sample data
iris = load_iris()
X, y = iris.data, iris.target
# Instantiate the model
model = LogisticRegression(max_iter=1000)
# Define the number of folds (K)
k = 5
# Use cross_val_score
# cv can be an integer (for KFold or StratifiedKFold) or a CV splitter object
# scoring specifies the metric to use (e.g., 'accuracy', 'neg_mean_squared_error')
scores = cross_val_score(model, X, y, cv=k, scoring='accuracy')
print(f"Running {k}-Fold Cross-Validation using cross_val_score...")
print(f" Scores for each fold: {scores}")
print(f" Average Accuracy: {scores.mean():.4f}")
print(f" Standard Deviation: {scores.std():.4f}")
# Expected Output (may differ slightly from manual loop if shuffle isn't explicitly matched):
# Running 5-Fold Cross-Validation using cross_val_score...
# Scores for each fold: [0.96666667 1. 0.93333333 0.96666667 1. ]
# Average Accuracy: 0.9733
# Standard Deviation: 0.0249
(Note: If cv
is an integer and the estimator is a classifier, cross_val_score
uses StratifiedKFold
by default, which we'll discuss next. For regression or if shuffle=False
, it uses KFold
. The slight difference in average accuracy compared to the manual example might be due to this or minor implementation details. To exactly replicate the manual KFold
behavior with shuffle, you can pass the kf
object directly: scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
)
The cross_val_score
function takes the estimator (your model instance), features X
, target y
, the cross-validation strategy cv
(here, just the integer 5, meaning use 5 folds), and a scoring
metric string. It returns an array containing the score for each fold. This is significantly more compact than the manual loop.
Important Note on Scoring: Scikit-learn's scoring functions follow a convention where higher values are better. For metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE), where lower values are better, Scikit-learn provides negated versions like 'neg_mean_squared_error'
or 'neg_mean_absolute_error'
. When using cross_val_score
with these, you'll get negative scores, and a "better" result will be closer to zero (i.e., a smaller negative number).
By using K-Fold cross-validation, either manually or with cross_val_score
, you obtain a more trustworthy estimate of how your model is likely to perform on new, unseen data compared to a single train-test split. This is a fundamental technique for reliable model evaluation and comparison.
© 2025 ApX Machine Learning