Let's put the concepts of model evaluation and selection into practice. In this section, we'll use Scikit-learn to perform train-test splits, apply cross-validation, and tune hyperparameters using grid search. This hands-on approach will solidify your understanding of how to assess and improve model performance reliably.
We will work with the Iris dataset, a classic dataset for classification tasks. Our goal is to build and evaluate models that can predict the species of an Iris flower based on its sepal and petal measurements.
First, let's import the necessary libraries and load the Iris dataset. We need train_test_split
, cross_val_score
, and GridSearchCV
from Scikit-learn, along with a classifier (like KNeighborsClassifier
) and evaluation metrics.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import plotly.express as px
import plotly.graph_objects as go
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
# Convert to a Pandas DataFrame for easier inspection (optional)
df = pd.DataFrame(X, columns=feature_names)
df['species'] = pd.Categorical.from_codes(y, target_names)
print("First 5 rows of Iris data:")
print(df.head())
print(f"\nFeatures: {feature_names}")
print(f"Target classes: {target_names}")
Before any model training, we split our data. This ensures we have a separate, unseen dataset (the test set) to evaluate the final model's generalization performance. We'll use train_test_split
. Using stratify=y
is important for classification tasks to maintain the proportion of each class in both the training and testing sets. We also set random_state
for reproducible results.
# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42, # For reproducibility
stratify=y # Maintain class proportions
)
print(f"\nTraining set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Testing set shape: X_test={X_test.shape}, y_test={y_test.shape}")
# Check class distribution in train and test sets
print("\nClass distribution in training set:")
print(np.bincount(y_train))
print("Class distribution in testing set:")
print(np.bincount(y_test))
As you can see, the stratify
argument helped maintain a balanced representation of each Iris species (setosa, versicolor, virginica) in both splits.
Let's train a K-Nearest Neighbors (KNN) classifier on the training data and evaluate it on the test set. It's often good practice to scale features, especially for distance-based algorithms like KNN. We'll create a simple pipeline for this.
# Create a pipeline with scaling and KNN classifier
knn_pipeline_simple = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5)) # Using default k=5 initially
])
# Train the model
knn_pipeline_simple.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn_pipeline_simple.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nInitial KNN Model (k=5) Accuracy on Test Set: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
# Calculate and display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
Confusion matrix for the initial KNN model (k=5) evaluated on the test set. Rows represent true labels, columns represent predicted labels.
The initial model achieved perfect accuracy on this particular test split. However, relying on a single train-test split might be optimistic or pessimistic depending on how the data was divided.
Cross-validation provides a more robust estimate of the model's performance by training and evaluating it on multiple different subsets of the data. We'll use StratifiedKFold
because it's a classification problem, ensuring class proportions are maintained in each fold. cross_val_score
simplifies this process.
Let's evaluate the same pipeline (Scaler + KNN with k=5) using 5-fold stratified cross-validation on the entire dataset. Note: In practice, you often perform cross-validation only on the training set during the model development phase, reserving the test set for a final, unbiased evaluation. Here, we use the full dataset to demonstrate the cross_val_score
function clearly.
# Define the cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Create the pipeline (same as before)
knn_pipeline_cv = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5))
])
# Perform cross-validation
# Note: cross_val_score clones the estimator for each fold, ensuring independence
cv_scores = cross_val_score(knn_pipeline_cv, X, y, cv=cv_strategy, scoring='accuracy')
print(f"\nCross-Validation Scores (k=5): {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.4f}")
print(f"Standard Deviation of CV Accuracy: {cv_scores.std():.4f}")
# Visualize the scores per fold
fold_indices = [f'Fold {i+1}' for i in range(len(cv_scores))]
Accuracy scores for each fold of the 5-fold stratified cross-validation and the mean accuracy across folds for the KNN model (k=5).
The cross-validation results give us a mean accuracy of approximately 96.7% with some variation between folds (standard deviation ~0.027). This is slightly lower than the perfect score on our single test split, highlighting the value of cross-validation for a more realistic performance estimate.
Our KNN model used n_neighbors=5
. Is this the optimal value for k
? We can use GridSearchCV
to systematically search through a range of hyperparameter values and find the best ones based on cross-validation performance.
GridSearchCV
combines hyperparameter tuning with cross-validation. It tries every combination of parameters specified in the grid, evaluates each combination using cross-validation on the training data, and identifies the combination that yields the best average score.
We'll define a parameter grid for the n_neighbors
parameter within our pipeline. Notice how we specify the parameter name: step_name__parameter_name
(e.g., knn__n_neighbors
).
# Define the pipeline again (important for GridSearch)
pipeline_gs = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
# Define the parameter grid to search
# We'll search for the best 'k' (n_neighbors) for the KNN step
param_grid = {
'knn__n_neighbors': np.arange(1, 16) # Test k values from 1 to 15
}
# Define the cross-validation strategy for GridSearchCV
cv_strategy_gs = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Set up GridSearchCV
# It will use the cv_strategy_gs for internal cross-validation
grid_search = GridSearchCV(
estimator=pipeline_gs,
param_grid=param_grid,
cv=cv_strategy_gs,
scoring='accuracy',
n_jobs=-1 # Use all available CPU cores
)
# Fit GridSearchCV on the training data
# Note: Fit GridSearchCV on the TRAINING data (X_train, y_train)
# The test set (X_test, y_test) is reserved for FINAL evaluation
print("\nRunning GridSearchCV...")
grid_search.fit(X_train, y_train)
print("GridSearchCV finished.")
# Get the best parameters and the best score
print(f"\nBest Parameters found by GridSearchCV: {grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy Score: {grid_search.best_score_:.4f}")
# The 'grid_search' object is now a trained model with the best found parameters
# Let's evaluate this best model on the held-out TEST set
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"\nAccuracy of Best Model on Test Set: {accuracy_best:.4f}")
print("\nClassification Report for Best Model:")
print(classification_report(y_test, y_pred_best, target_names=target_names))
# Optionally, visualize CV results from GridSearchCV
results_df = pd.DataFrame(grid_search.cv_results_)
best_k = grid_search.best_params_['knn__n_neighbors']
Mean cross-validation accuracy scores obtained during Grid Search for different values of
k
(number of neighbors). The best performing value, k=7, is highlighted.
GridSearchCV found that n_neighbors=7
yielded the highest average accuracy during its internal cross-validation process on the training data (around 96.7%). Evaluating this optimized model on our held-out test set resulted in perfect accuracy again in this specific case. While the test accuracy didn't change much here compared to k=5, in many real-world scenarios, tuning hyperparameters significantly improves performance on unseen data.
In this practical exercise, you learned how to:
train_test_split
with stratification for reliable initial evaluation.cross_val_score
to get a more robust estimate of model performance.GridSearchCV
, which combines parameter tuning with internal cross-validation.These techniques are fundamental for building trustworthy machine learning models. By rigorously evaluating performance and selecting appropriate hyperparameters, you can avoid common pitfalls like overfitting and build models that perform well on new, unseen data.
© 2025 ApX Machine Learning