As we saw with learning curves, simply minimizing the loss on the training data doesn't guarantee good performance on unseen data. We need reliable ways to estimate how well our model will generalize before deploying it. This requires carefully partitioning our data to simulate the scenario of encountering new examples.
The most common strategy in deep learning, especially with large datasets, is to split the available data into three distinct sets:
A typical split might be 70% for training, 15% for validation, and 15% for testing, but these percentages can vary significantly depending on the total dataset size. For very large datasets (millions of examples), the validation and test sets might be much smaller percentages (e.g., 1% each) while still being large enough in absolute terms to provide reliable estimates.
The main advantage of the hold-out method is its simplicity and computational efficiency. However, the performance estimate obtained on the validation set can be sensitive to the specific random split, especially if the dataset isn't massive. A "lucky" or "unlucky" split might give a misleading impression of the model's true capabilities.
When data is scarce, or when you need a more robust estimate of generalization performance, k-Fold Cross-Validation is a valuable technique. Instead of a single validation split, k-fold CV uses multiple splits to reduce the variance associated with the hold-out method.
Here's how it works:
A conceptual view of k-Fold Cross-Validation. The training data is split into k folds. In each iteration, one fold serves as the validation set (red) while the others are used for training (blue). Performance metrics from each iteration are averaged.
Advantages of k-Fold CV:
Disadvantages of k-Fold CV:
Stratified k-Fold: When dealing with classification problems, especially with imbalanced classes, it's important that each fold retains approximately the same percentage of samples for each class as the complete set. Stratified k-Fold Cross-Validation ensures this stratification, leading to more reliable estimates.
While k-Fold CV is less frequently used for final model training in deep learning due to the cost, it's a very useful technique for:
Whether using a simple hold-out split or preparing for cross-validation, careful implementation is needed. Libraries like Scikit-learn provide convenient functions.
import torch
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Assume X_data contains your features and y_data contains your labels (as NumPy arrays or Tensors)
# Example dummy data
X_data = np.random.rand(1000, 20) # 1000 samples, 20 features
y_data = np.random.randint(0, 2, 1000) # 1000 binary labels
# --- Hold-Out Split ---
# First, split into training+validation and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(
X_data, y_data, test_size=0.15, random_state=42, stratify=y_data # Use stratify for classification
)
# Then, split training+validation into actual training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
X_train_val, y_train_val, test_size=0.1765, # Approximately 15% of original data (0.15 / (1-0.15))
random_state=42, stratify=y_train_val # Maintain stratification
)
print(f"Hold-Out Sizes: Train={len(X_train)}, Validation={len(X_val)}, Test={len(X_test)}")
# Output: Hold-Out Sizes: Train=700, Validation=150, Test=150 (approx ratios)
# --- k-Fold Cross-Validation Setup (using StratifiedKFold) ---
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
print(f"\n{n_splits}-Fold Cross-Validation Indices (using original Training+Validation data):")
# Note: We use X_train_val here, as the test set is always kept separate.
fold_num = 1
for train_index, val_index in skf.split(X_train_val, y_train_val):
print(f"Fold {fold_num}:")
print(f" Train samples: {len(train_index)}, Validation samples: {len(val_index)}")
# In a real CV loop, you would select the data using these indices:
# X_train_fold, X_val_fold = X_train_val[train_index], X_train_val[val_index]
# y_train_fold, y_val_fold = y_train_val[train_index], y_train_val[val_index]
# ... then train and evaluate model on these fold-specific datasets
fold_num += 1
# Convert to PyTorch Tensors if needed for training
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long) # Assuming classification labels
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)
Using Scikit-learn to create hold-out splits and generate indices for k-Fold Cross-Validation.
random_state
ensures reproducibility, andstratify
maintains class proportions.
Remember:
shuffle=True
.Choosing the right validation strategy is fundamental. It allows us to reliably estimate generalization performance, diagnose problems like overfitting, and make informed decisions during model development, ultimately leading to models that perform well on the data they will encounter in the real world.
© 2025 ApX Machine Learning