This hands-on exercise guides users through building, comparing, and optimizing predictive models using scikit-learn. It applies techniques for training, evaluating, and tuning supervised models. A prepared dataset is utilized, with data acquisition and feature engineering steps already completed to yield suitable training and testing sets.For this exercise, imagine we have a classification task. We'll use X_train, y_train for training and validation, and X_test, y_test for the final evaluation. Make sure you have pandas, numpy, and scikit-learn installed and imported.import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score from sklearn.preprocessing import StandardScaler # Assume X and y are already loaded (e.g., from a CSV file) # For demonstration, let's create synthetic data from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, n_classes=2, random_state=42) # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Scale numerical features (important for Logistic Regression, good practice generally) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print(f"Training set shape: {X_train_scaled.shape}") print(f"Test set shape: {X_test_scaled.shape}")Baseline Model: Logistic RegressionLet's start with a straightforward model, Logistic Regression, to establish a baseline performance. We'll use 5-fold cross-validation ($k=5$) on the training data to get a reliable estimate of its performance. We'll use accuracy as the primary metric for now, but we'll look at others later.# Initialize the model log_reg = LogisticRegression(random_state=42, max_iter=1000) # Increased max_iter for convergence # Perform 5-fold cross-validation cv_scores_log_reg = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='accuracy') print(f"Logistic Regression CV Accuracy Scores: {cv_scores_log_reg}") print(f"Mean CV Accuracy: {np.mean(cv_scores_log_reg):.4f}") print(f"Standard Deviation CV Accuracy: {np.std(cv_scores_log_reg):.4f}")This gives us an average accuracy score and its standard deviation across the folds. This is our initial benchmark. Any subsequent model or tuning should aim to improve upon this mean accuracy, ideally while maintaining a low standard deviation.Trying a More Complex Model: Random ForestNow, let's try a more complex, tree-based ensemble model like Random Forest. We'll use default hyperparameters first and evaluate it using the same cross-validation strategy.# Initialize the model rf_clf = RandomForestClassifier(random_state=42) # Perform 5-fold cross-validation cv_scores_rf = cross_val_score(rf_clf, X_train_scaled, y_train, cv=5, scoring='accuracy') print(f"Random Forest CV Accuracy Scores: {cv_scores_rf}") print(f"Mean CV Accuracy: {np.mean(cv_scores_rf):.4f}") print(f"Standard Deviation CV Accuracy: {np.std(cv_scores_rf):.4f}")Compare the mean accuracy of the Random Forest with the Logistic Regression baseline. Often, more complex models can achieve higher performance out-of-the-box, but they also have more hyperparameters that can significantly influence their behavior. Let's assume the Random Forest performed better and proceed to tune it.Hyperparameter Tuning with Grid SearchGrid Search systematically works through multiple combinations of parameter tunes, cross-validating each combination to determine which one gives the best performance according to a specified scoring metric. Let's tune n_estimators (number of trees) and max_depth (maximum depth of each tree) for our RandomForestClassifier.# Define the parameter grid param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5] } # Initialize the Grid Search # We use the scaled training data grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, # Using 3 folds for faster search in this example scoring='accuracy', n_jobs=-1, # Use all available CPU cores verbose=1) # Shows progress # Fit Grid Search to the data grid_search.fit(X_train_scaled, y_train) # Print the best parameters and the corresponding score print(f"Best parameters found by Grid Search: {grid_search.best_params_}") print(f"Best cross-validation accuracy score: {grid_search.best_score_:.4f}") # Access the best estimator directly best_rf_grid = grid_search.best_estimator_Grid Search examines all combinations defined in param_grid. In this case, $3 \times 3 \times 2 = 18$ combinations. Each combination is evaluated using 3-fold cross-validation, meaning $18 \times 3 = 54$ model fits are performed. n_jobs=-1 parallelizes this process.Hyperparameter Tuning with Randomized SearchWhen the hyperparameter space is large, Grid Search can become computationally expensive. Randomized Search offers a more efficient alternative by sampling a fixed number of parameter combinations from specified distributions or lists.Let's try tuning the same Random Forest model, but this time exploring more parameters or wider ranges using RandomizedSearchCV.from scipy.stats import randint # Define the parameter distributions param_dist = { 'n_estimators': randint(50, 300), # Sample integers between 50 and 299 'max_depth': [None, 10, 20, 30], # List to choose from 'min_samples_split': randint(2, 11), # Sample integers between 2 and 10 'min_samples_leaf': randint(1, 11), # Sample integers between 1 and 10 'bootstrap': [True, False] # Boolean options } # Initialize Randomized Search random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42), param_distributions=param_dist, n_iter=20, # Number of parameter settings that are sampled cv=3, # Using 3 folds scoring='accuracy', n_jobs=-1, random_state=42, # For reproducibility verbose=1) # Fit Randomized Search to the data random_search.fit(X_train_scaled, y_train) # Print the best parameters and the corresponding score print(f"Best parameters found by Randomized Search: {random_search.best_params_}") print(f"Best cross-validation accuracy score: {random_search.best_score_:.4f}") # Access the best estimator best_rf_random = random_search.best_estimator_RandomizedSearchCV performs n_iter (here, 20) trials. In each trial, it randomly samples values for each hyperparameter from the specified distributions or lists. This often finds very good parameter combinations much faster than an exhaustive Grid Search, especially when some hyperparameters are more influential than others.Final Evaluation on the Test SetAfter tuning, we select the best model found (let's use the one from Randomized Search in this example, assuming it gave the best CV score). We train this final model configuration on the entire training set (X_train_scaled, y_train) and then evaluate its performance on the unseen test set (X_test_scaled, y_test). This provides an unbiased estimate of how the model is expected to perform on new data.# Use the best estimator found by Randomized Search final_model = random_search.best_estimator_ # Note: The best_estimator_ from GridSearchCV/RandomizedSearchCV # is already trained on the whole training set when refit=True (default). # If refit=False, you'd need to train it explicitly: # final_model.fit(X_train_scaled, y_train) # Make predictions on the test set y_pred_test = final_model.predict(X_test_scaled) y_pred_proba_test = final_model.predict_proba(X_test_scaled)[:, 1] # Probabilities for ROC AUC # Evaluate the final model test_accuracy = accuracy_score(y_test, y_pred_test) test_roc_auc = roc_auc_score(y_test, y_pred_proba_test) print(f"\n--- Final Model Evaluation on Test Set ---") print(f"Test Accuracy: {test_accuracy:.4f}") print(f"Test ROC AUC: {test_roc_auc:.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred_test)) print("\nConfusion Matrix:") print(confusion_matrix(y_test, y_pred_test))The classification report provides Precision, Recall, and $F_1$-score for each class, giving a more detailed view of performance than accuracy alone. The confusion matrix shows the counts of true positives, true negatives, false positives, and false negatives. The ROC AUC score evaluates the model's ability to distinguish between classes based on predicted probabilities.Let's visualize the feature importances from our final tuned Random Forest model. This can provide insights into which features the model relies on most heavily.# Get feature importances importances = final_model.feature_importances_ indices = np.argsort(importances)[::-1] # Sort features by importance # Prepare data for plotting (top 10 features) n_top_features = 10 feature_names = [f"Feature {i}" for i in range(X_train_scaled.shape[1])] # Placeholder names top_indices = indices[:n_top_features] top_importances = importances[top_indices] top_feature_names = [feature_names[i] for i in top_indices] # Create Plotly bar chart JSON plotly_json = { "data": [{ "type": "bar", "x": top_importances[::-1], # Reverse for ascending order in plot "y": top_feature_names[::-1], "orientation": "h", "marker": {"color": "#228be6"} # Blue color }], "layout": { "title": "Top 10 Feature Importances (Random Forest)", "xaxis": {"title": "Importance Score"}, "yaxis": {"title": "Feature"}, "margin": {"l": 100, "r": 20, "t": 50, "b": 50} # Adjust margins } } import json print("```plotly") print(json.dumps(plotly_json)) print("```") {"data": [{"type": "bar", "x": [0.04168767854229687, 0.04391775941518639, 0.04534640761849839, 0.05306566444728158, 0.05559369878094566, 0.06131167764431487, 0.06316479581190193, 0.06491901879378965, 0.0668898282672587, 0.06771243136811413], "y": ["Feature 6", "Feature 1", "Feature 16", "Feature 18", "Feature 19", "Feature 8", "Feature 4", "Feature 7", "Feature 15", "Feature 11"], "orientation": "h", "marker": {"color": "#228be6"}}], "layout": {"title": "Top 10 Feature Importances (Random Forest)", "xaxis": {"title": "Importance Score"}, "yaxis": {"title": "Feature"}, "margin": {"l": 100, "r": 20, "t": 50, "b": 50}}}Feature importances derived from the final tuned Random Forest model, showing the top 10 most influential features according to the Gini importance measure.SummaryIn this hands-on section, you practiced the complete workflow of training, evaluating, and tuning supervised learning models:Established a baseline performance using Logistic Regression and cross-validation.Trained a more complex model (Random Forest) and compared its initial performance.Applied GridSearchCV to systematically search a defined hyperparameter space.Utilized RandomizedSearchCV for a more efficient search over a larger space or distributions.Selected the best hyperparameter configuration based on cross-validation performance.Evaluated the final, tuned model on a held-out test set using multiple metrics (Accuracy, ROC AUC, Classification Report, Confusion Matrix) to get an unbiased performance estimate.Visualized feature importances to gain insights from the model.This iterative process of training, evaluating, and tuning is central to developing effective machine learning models. Remember that the choice of model, hyperparameters, evaluation metrics, and tuning strategy depends heavily on the specific problem, the dataset characteristics, and the computational resources available.