Now that you understand the concepts behind training, evaluating, and tuning supervised models, let's put theory into practice. This hands-on exercise guides you through building, comparing, and optimizing predictive models using scikit-learn, focusing on the techniques discussed in this chapter. We'll work with a prepared dataset, assuming the data acquisition and feature engineering steps have already yielded suitable training and testing sets.
For this exercise, imagine we have a classification task. We'll use X_train
, y_train
for training and validation, and X_test
, y_test
for the final evaluation. Make sure you have pandas
, numpy
, and scikit-learn
installed and imported.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler
# Assume X and y are already loaded (e.g., from a CSV file)
# For demonstration, let's create synthetic data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_classes=2, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Scale numerical features (important for Logistic Regression, good practice generally)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")
Let's start with a straightforward model, Logistic Regression, to establish a baseline performance. We'll use 5-fold cross-validation (k=5) on the training data to get a reliable estimate of its performance. We'll use accuracy as the primary metric for now, but we'll look at others later.
# Initialize the model
log_reg = LogisticRegression(random_state=42, max_iter=1000) # Increased max_iter for convergence
# Perform 5-fold cross-validation
cv_scores_log_reg = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='accuracy')
print(f"Logistic Regression CV Accuracy Scores: {cv_scores_log_reg}")
print(f"Mean CV Accuracy: {np.mean(cv_scores_log_reg):.4f}")
print(f"Standard Deviation CV Accuracy: {np.std(cv_scores_log_reg):.4f}")
This gives us an average accuracy score and its standard deviation across the folds. This is our initial benchmark. Any subsequent model or tuning should aim to improve upon this mean accuracy, ideally while maintaining a low standard deviation.
Now, let's try a more complex, tree-based ensemble model like Random Forest. We'll use default hyperparameters first and evaluate it using the same cross-validation strategy.
# Initialize the model
rf_clf = RandomForestClassifier(random_state=42)
# Perform 5-fold cross-validation
cv_scores_rf = cross_val_score(rf_clf, X_train_scaled, y_train, cv=5, scoring='accuracy')
print(f"Random Forest CV Accuracy Scores: {cv_scores_rf}")
print(f"Mean CV Accuracy: {np.mean(cv_scores_rf):.4f}")
print(f"Standard Deviation CV Accuracy: {np.std(cv_scores_rf):.4f}")
Compare the mean accuracy of the Random Forest with the Logistic Regression baseline. Often, more complex models can achieve higher performance out-of-the-box, but they also have more hyperparameters that can significantly influence their behavior. Let's assume the Random Forest performed better and proceed to tune it.
Grid Search systematically works through multiple combinations of parameter tunes, cross-validating each combination to determine which one gives the best performance according to a specified scoring metric. Let's tune n_estimators
(number of trees) and max_depth
(maximum depth of each tree) for our RandomForestClassifier
.
# Define the parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5]
}
# Initialize the Grid Search
# We use the scaled training data
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=3, # Using 3 folds for faster search in this example
scoring='accuracy',
n_jobs=-1, # Use all available CPU cores
verbose=1) # Shows progress
# Fit Grid Search to the data
grid_search.fit(X_train_scaled, y_train)
# Print the best parameters and the corresponding score
print(f"Best parameters found by Grid Search: {grid_search.best_params_}")
print(f"Best cross-validation accuracy score: {grid_search.best_score_:.4f}")
# Access the best estimator directly
best_rf_grid = grid_search.best_estimator_
Grid Search explores all combinations defined in param_grid
. In this case, 3×3×2=18 combinations. Each combination is evaluated using 3-fold cross-validation, meaning 18×3=54 model fits are performed. n_jobs=-1
parallelizes this process.
When the hyperparameter space is large, Grid Search can become computationally expensive. Randomized Search offers a more efficient alternative by sampling a fixed number of parameter combinations from specified distributions or lists.
Let's try tuning the same Random Forest model, but this time exploring more parameters or wider ranges using RandomizedSearchCV
.
from scipy.stats import randint
# Define the parameter distributions
param_dist = {
'n_estimators': randint(50, 300), # Sample integers between 50 and 299
'max_depth': [None, 10, 20, 30], # List to choose from
'min_samples_split': randint(2, 11), # Sample integers between 2 and 10
'min_samples_leaf': randint(1, 11), # Sample integers between 1 and 10
'bootstrap': [True, False] # Boolean options
}
# Initialize Randomized Search
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=20, # Number of parameter settings that are sampled
cv=3, # Using 3 folds
scoring='accuracy',
n_jobs=-1,
random_state=42, # For reproducibility
verbose=1)
# Fit Randomized Search to the data
random_search.fit(X_train_scaled, y_train)
# Print the best parameters and the corresponding score
print(f"Best parameters found by Randomized Search: {random_search.best_params_}")
print(f"Best cross-validation accuracy score: {random_search.best_score_:.4f}")
# Access the best estimator
best_rf_random = random_search.best_estimator_
RandomizedSearchCV
performs n_iter
(here, 20) trials. In each trial, it randomly samples values for each hyperparameter from the specified distributions or lists. This often finds very good parameter combinations much faster than an exhaustive Grid Search, especially when some hyperparameters are more influential than others.
After tuning, we select the best model found (let's use the one from Randomized Search in this example, assuming it gave the best CV score). We train this final model configuration on the entire training set (X_train_scaled
, y_train
) and then evaluate its performance on the unseen test set (X_test_scaled
, y_test
). This provides an unbiased estimate of how the model is expected to perform on new data.
# Use the best estimator found by Randomized Search
final_model = random_search.best_estimator_
# Note: The best_estimator_ from GridSearchCV/RandomizedSearchCV
# is already trained on the whole training set when refit=True (default).
# If refit=False, you'd need to train it explicitly:
# final_model.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_test = final_model.predict(X_test_scaled)
y_pred_proba_test = final_model.predict_proba(X_test_scaled)[:, 1] # Probabilities for ROC AUC
# Evaluate the final model
test_accuracy = accuracy_score(y_test, y_pred_test)
test_roc_auc = roc_auc_score(y_test, y_pred_proba_test)
print(f"\n--- Final Model Evaluation on Test Set ---")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test ROC AUC: {test_roc_auc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_test))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))
The classification report provides Precision, Recall, and F1-score for each class, giving a more detailed view of performance than accuracy alone. The confusion matrix shows the counts of true positives, true negatives, false positives, and false negatives. The ROC AUC score evaluates the model's ability to distinguish between classes based on predicted probabilities.
Let's visualize the feature importances from our final tuned Random Forest model. This can provide insights into which features the model relies on most heavily.
# Get feature importances
importances = final_model.feature_importances_
indices = np.argsort(importances)[::-1] # Sort features by importance
# Prepare data for plotting (top 10 features)
n_top_features = 10
feature_names = [f"Feature {i}" for i in range(X_train_scaled.shape[1])] # Placeholder names
top_indices = indices[:n_top_features]
top_importances = importances[top_indices]
top_feature_names = [feature_names[i] for i in top_indices]
# Create Plotly bar chart JSON
plotly_json = {
"data": [{
"type": "bar",
"x": top_importances[::-1], # Reverse for ascending order in plot
"y": top_feature_names[::-1],
"orientation": "h",
"marker": {"color": "#228be6"} # Blue color
}],
"layout": {
"title": "Top 10 Feature Importances (Random Forest)",
"xaxis": {"title": "Importance Score"},
"yaxis": {"title": "Feature"},
"margin": {"l": 100, "r": 20, "t": 50, "b": 50} # Adjust margins
}
}
import json
print("```plotly")
print(json.dumps(plotly_json))
print("```")
Feature importances derived from the final tuned Random Forest model, showing the top 10 most influential features according to the Gini importance measure.
In this hands-on section, you practiced the complete workflow of training, evaluating, and tuning supervised learning models:
GridSearchCV
to systematically search a defined hyperparameter space.RandomizedSearchCV
for a more efficient search over a larger space or distributions.This iterative process of training, evaluating, and tuning is central to developing effective machine learning models. Remember that the choice of model, hyperparameters, evaluation metrics, and tuning strategy depends heavily on the specific problem, the dataset characteristics, and the computational resources available.
© 2025 ApX Machine Learning