Scikit-learn is used to perform train-test splits, apply cross-validation, and tune hyperparameters using grid search. This hands-on approach will solidify your understanding of how to assess and improve model performance reliably.We will work with the Iris dataset, a classic dataset for classification tasks. Our goal is to build and evaluate models that can predict the species of an Iris flower based on its sepal and petal measurements.Setup and Data LoadingFirst, let's import the necessary libraries and load the Iris dataset. We need train_test_split, cross_val_score, and GridSearchCV from Scikit-learn, along with a classifier (like KNeighborsClassifier) and evaluation metrics.import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, classification_report, confusion_matrix from sklearn.pipeline import Pipeline import plotly.express as px import plotly.graph_objects as go # Load the Iris dataset iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names # Convert to a Pandas DataFrame for easier inspection (optional) df = pd.DataFrame(X, columns=feature_names) df['species'] = pd.Categorical.from_codes(y, target_names) print("First 5 rows of Iris data:") print(df.head()) print(f"\nFeatures: {feature_names}") print(f"Target classes: {target_names}")1. Splitting Data into Training and Testing SetsBefore any model training, we split our data. This ensures we have a separate, unseen dataset (the test set) to evaluate the final model's generalization performance. We'll use train_test_split. Using stratify=y is important for classification tasks to maintain the proportion of each class in both the training and testing sets. We also set random_state for reproducible results.# Split the data into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, # For reproducibility stratify=y # Maintain class proportions ) print(f"\nTraining set shape: X_train={X_train.shape}, y_train={y_train.shape}") print(f"Testing set shape: X_test={X_test.shape}, y_test={y_test.shape}") # Check class distribution in train and test sets print("\nClass distribution in training set:") print(np.bincount(y_train)) print("Class distribution in testing set:") print(np.bincount(y_test))As you can see, the stratify argument helped maintain a balanced representation of each Iris species (setosa, versicolor, virginica) in both splits.2. Initial Model Training and Evaluation (using Train-Test Split)Let's train a K-Nearest Neighbors (KNN) classifier on the training data and evaluate it on the test set. It's often good practice to scale features, especially for distance-based algorithms like KNN. We'll create a simple pipeline for this.# Create a pipeline with scaling and KNN classifier knn_pipeline_simple = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=5)) # Using default k=5 initially ]) # Train the model knn_pipeline_simple.fit(X_train, y_train) # Make predictions on the test set y_pred = knn_pipeline_simple.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) print(f"\nInitial KNN Model (k=5) Accuracy on Test Set: {accuracy:.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=target_names)) # Calculate and display the confusion matrix cm = confusion_matrix(y_test, y_pred){"layout": {"title": "Confusion Matrix (k=5, Test Set)", "xaxis": {"title": "Predicted Label", "tickmode": "array", "tickvals": [0, 1, 2], "ticktext": ["setosa", "versicolor", "virginica"]}, "yaxis": {"title": "True Label", "tickmode": "array", "tickvals": [0, 1, 2], "ticktext": ["setosa", "versicolor", "virginica"], "autorange": "reversed"}, "width": 500, "height": 400, "annotations": []}, "data": [{"type": "heatmap", "z": [[10, 0, 0], [0, 10, 0], [0, 0, 10]], "x": ["setosa", "versicolor", "virginica"], "y": ["setosa", "versicolor", "virginica"], "colorscale": "Blues", "showscale": false}]}Confusion matrix for the initial KNN model (k=5) evaluated on the test set. Rows represent true labels, columns represent predicted labels.The initial model achieved perfect accuracy on this particular test split. However, relying on a single train-test split might be optimistic or pessimistic depending on how the data was divided.3. Evaluating Performance with Cross-ValidationCross-validation provides a more accurate estimate of the model's performance by training and evaluating it on multiple different subsets of the data. We'll use StratifiedKFold because it's a classification problem, ensuring class proportions are maintained in each fold. cross_val_score simplifies this process.Let's evaluate the same pipeline (Scaler + KNN with k=5) using 5-fold stratified cross-validation on the entire dataset. Note: In practice, you often perform cross-validation only on the training set during the model development phase, reserving the test set for a final, unbiased evaluation. Here, we use the full dataset to demonstrate the cross_val_score function clearly.# Define the cross-validation strategy cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Create the pipeline (same as before) knn_pipeline_cv = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=5)) ]) # Perform cross-validation # Note: cross_val_score clones the estimator for each fold, ensuring independence cv_scores = cross_val_score(knn_pipeline_cv, X, y, cv=cv_strategy, scoring='accuracy') print(f"\nCross-Validation Scores (k=5): {cv_scores}") print(f"Mean CV Accuracy: {cv_scores.mean():.4f}") print(f"Standard Deviation of CV Accuracy: {cv_scores.std():.4f}") # Visualize the scores per fold fold_indices = [f'Fold {i+1}' for i in range(len(cv_scores))]{"layout": {"title": "5-Fold Cross-Validation Accuracy (k=5)", "xaxis": {"title": "Fold"}, "yaxis": {"title": "Accuracy", "range": [0.8, 1.05]}, "width": 600, "height": 400}, "data": [{"type": "bar", "x": ["Fold 1", "Fold 2", "Fold 3", "Fold 4", "Fold 5"], "y": [1.0, 0.9666666666666667, 0.9333333333333333, 0.9333333333333333, 1.0], "marker": {"color": "#228be6"}}, {"type": "scatter", "x": ["Fold 1", "Fold 2", "Fold 3", "Fold 4", "Fold 5"], "y": [0.9666666666666667, 0.9666666666666667, 0.9666666666666667, 0.9666666666666667, 0.9666666666666667], "mode": "lines", "line": {"dash": "dash", "color": "#fa5252"}, "name": "Mean Accuracy"}]}Accuracy scores for each fold of the 5-fold stratified cross-validation and the mean accuracy across folds for the KNN model (k=5).The cross-validation results give us a mean accuracy of approximately 96.7% with some variation between folds (standard deviation ~0.027). This is slightly lower than the perfect score on our single test split, highlighting the value of cross-validation for a more realistic performance estimate.4. Hyperparameter Tuning with GridSearchCVOur KNN model used n_neighbors=5. Is this the optimal value for k? We can use GridSearchCV to systematically search through a range of hyperparameter values and find the best ones based on cross-validation performance.GridSearchCV combines hyperparameter tuning with cross-validation. It tries every combination of parameters specified in the grid, evaluates each combination using cross-validation on the training data, and identifies the combination that yields the best average score.We'll define a parameter grid for the n_neighbors parameter within our pipeline. Notice how we specify the parameter name: step_name__parameter_name (e.g., knn__n_neighbors).# Define the pipeline again (important for GridSearch) pipeline_gs = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier()) ]) # Define the parameter grid to search # We'll search for the best 'k' (n_neighbors) for the KNN step param_grid = { 'knn__n_neighbors': np.arange(1, 16) # Test k values from 1 to 15 } # Define the cross-validation strategy for GridSearchCV cv_strategy_gs = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Set up GridSearchCV # It will use the cv_strategy_gs for internal cross-validation grid_search = GridSearchCV( estimator=pipeline_gs, param_grid=param_grid, cv=cv_strategy_gs, scoring='accuracy', n_jobs=-1 # Use all available CPU cores ) # Fit GridSearchCV on the training data # Note: Fit GridSearchCV on the TRAINING data (X_train, y_train) # The test set (X_test, y_test) is reserved for FINAL evaluation print("\nRunning GridSearchCV...") grid_search.fit(X_train, y_train) print("GridSearchCV finished.") # Get the best parameters and the best score print(f"\nBest Parameters found by GridSearchCV: {grid_search.best_params_}") print(f"Best Cross-Validation Accuracy Score: {grid_search.best_score_:.4f}") # The 'grid_search' object is now a trained model with the best found parameters # Let's evaluate this best model on the held-out TEST set best_model = grid_search.best_estimator_ y_pred_best = best_model.predict(X_test) accuracy_best = accuracy_score(y_test, y_pred_best) print(f"\nAccuracy of Best Model on Test Set: {accuracy_best:.4f}") print("\nClassification Report for Best Model:") print(classification_report(y_test, y_pred_best, target_names=target_names)) # Optionally, visualize CV results from GridSearchCV results_df = pd.DataFrame(grid_search.cv_results_) best_k = grid_search.best_params_['knn__n_neighbors']{"layout": {"title": "Grid Search CV Results: Accuracy vs. n_neighbors", "xaxis": {"title": "Number of Neighbors (k)"}, "yaxis": {"title": "Mean CV Accuracy"}, "width": 700, "height": 450}, "data": [{"type": "scatter", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "y": [0.9416666666666667, 0.9333333333333333, 0.95, 0.9583333333333334, 0.95, 0.9583333333333334, 0.9666666666666668, 0.9583333333333334, 0.9583333333333334, 0.9583333333333334, 0.9583333333333334, 0.9583333333333334, 0.95, 0.9416666666666667, 0.9416666666666667], "mode": "lines+markers", "name": "Mean CV Accuracy"}, {"type": "scatter", "x": [7], "y": [0.9666666666666668], "mode": "markers", "marker": {"size": 12, "color": "#fa5252", "symbol": "star"}, "name": "Best k (k=7)"}]}Mean cross-validation accuracy scores obtained during Grid Search for different values of k (number of neighbors). The best performing value, k=7, is highlighted."GridSearchCV found that n_neighbors=7 yielded the highest average accuracy during its internal cross-validation process on the training data (around 96.7%). Evaluating this optimized model on our held-out test set resulted in perfect accuracy again in this specific case. While the test accuracy didn't change much here compared to k=5, in many scenarios, tuning hyperparameters significantly improves performance on unseen data."SummaryIn this practical exercise, you learned how to:Split data using train_test_split with stratification for reliable initial evaluation.Perform K-Fold (specifically Stratified K-Fold) cross-validation using cross_val_score to get a better estimate of model performance.Systematically search for optimal hyperparameters using GridSearchCV, which combines parameter tuning with internal cross-validation.Evaluate the final, tuned model on the reserved test set to estimate its generalization ability.These techniques are fundamental for building trustworthy machine learning models. By rigorously evaluating performance and selecting appropriate hyperparameters, you can avoid common issues like overfitting and build models that perform well on new, unseen data.