Practical implementation of classification models using Scikit-learn. This involves Logistic Regression, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM), with an evaluation of their performance on a standard dataset and interpretation of the results. A working Python environment with Scikit-learn, NumPy, and Pandas installed is assumed.Setting Up the Environment and DataFirst, let's import the necessary libraries and load a dataset. We will use the well-known Iris dataset, which is conveniently included in Scikit-learn. This dataset contains measurements for 150 iris flowers belonging to three different species: setosa, versicolor, and virginica. The goal is to classify the species based on sepal length, sepal width, petal length, and petal width.import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report, confusion_matrix import plotly.graph_objects as go import plotly.io as pio # Configure Plotly for better display pio.templates.default = "plotly_white" # Load the dataset iris = load_iris() X = iris.data y = iris.target feature_names = iris.feature_names target_names = iris.target_names # Create a DataFrame for easier inspection (optional) df = pd.DataFrame(X, columns=feature_names) df['species'] = y df['species_name'] = df['species'].map({i: name for i, name in enumerate(target_names)}) # print(df.head()) # print(f"Target names: {target_names}") # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # print(f"Training set shape: X={X_train.shape}, y={y_train.shape}") # print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}") # Feature Scaling # Algorithms like KNN and SVM are sensitive to feature scales. # Logistic Regression can also benefit. scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)We load the data, separate features (X) and the target variable (y), and then split them into training and testing sets using train_test_split. The stratify=y argument ensures that the proportion of each class is approximately the same in both the training and testing sets, which is important for classification tasks, especially with imbalanced datasets (though Iris is balanced). Finally, we apply StandardScaler to standardize the features by removing the mean and scaling to unit variance. Note that we fit the scaler only on the training data and then transform both the training and testing data to prevent information leakage from the test set.Implementing Logistic RegressionLogistic Regression is a linear model commonly used for binary classification, but Scikit-learn's implementation also supports multi-class problems (like Iris) using a one-vs-rest (OvR) or multinomial scheme.# Initialize and train the Logistic Regression model log_reg = LogisticRegression(random_state=42, multi_class='ovr', solver='liblinear') # Using scaled data log_reg.fit(X_train_scaled, y_train) # Make predictions on the test set y_pred_log_reg = log_reg.predict(X_test_scaled) # Evaluate the model accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg) report_log_reg = classification_report(y_test, y_pred_log_reg, target_names=target_names) print("--- Logistic Regression Evaluation ---") print(f"Accuracy: {accuracy_log_reg:.4f}") print("Classification Report:") print(report_log_reg)Here, we initialize LogisticRegression. We specify multi_class='ovr' (One-vs-Rest) and choose a solver suitable for this dataset ('liblinear' works well for smaller datasets). We train the model using the scaled training data (X_train_scaled, y_train) and then predict labels for the scaled test data (X_test_scaled). Finally, we calculate the accuracy and generate a classification report, which includes precision, recall, and F1-score for each class.Implementing K-Nearest Neighbors (KNN)KNN classifies a data point based on the majority class among its 'k' nearest neighbors in the feature space. The choice of 'k' and the distance metric are important considerations. Since KNN relies on distance calculations, feature scaling is generally required.# Initialize and train the KNN model # Let's start with k=5 knn = KNeighborsClassifier(n_neighbors=5) # Using scaled data knn.fit(X_train_scaled, y_train) # Make predictions y_pred_knn = knn.predict(X_test_scaled) # Evaluate the model accuracy_knn = accuracy_score(y_test, y_pred_knn) report_knn = classification_report(y_test, y_pred_knn, target_names=target_names) print("\n--- K-Nearest Neighbors (k=5) Evaluation ---") print(f"Accuracy: {accuracy_knn:.4f}") print("Classification Report:") print(report_knn)We initialize KNeighborsClassifier with n_neighbors=5 (a common starting point for 'k'). We train it on the scaled training data and evaluate it on the scaled test data, just like with Logistic Regression. The performance of KNN can be sensitive to the value of k; experimentation (often using techniques like cross-validation discussed in Chapter 5) is usually needed to find an optimal value.Implementing Support Vector Machines (SVM)SVMs aim to find the optimal hyperplane that separates different classes in the feature space. We'll use the SVC (Support Vector Classifier) class from Scikit-learn. SVMs also typically require scaled features.# Initialize and train the SVM model # Using default parameters (RBF kernel, C=1.0) svm_clf = SVC(random_state=42) # Using scaled data svm_clf.fit(X_train_scaled, y_train) # Make predictions y_pred_svm = svm_clf.predict(X_test_scaled) # Evaluate the model accuracy_svm = accuracy_score(y_test, y_pred_svm) report_svm = classification_report(y_test, y_pred_svm, target_names=target_names) print("\n--- Support Vector Machine (SVC) Evaluation ---") print(f"Accuracy: {accuracy_svm:.4f}") print("Classification Report:") print(report_svm)We initialize SVC using its default parameters, which include the Radial Basis Function (RBF) kernel. Training and evaluation follow the same pattern as before, using the scaled data. SVMs have several hyperparameters (like C and gamma for the RBF kernel) that significantly influence performance; tuning these is covered in later chapters.Visualizing Confusion MatricesA confusion matrix provides a more detailed breakdown of classification performance than accuracy alone. It shows the number of correct and incorrect predictions for each class. Let's visualize the confusion matrix for the SVM model using Plotly.# Calculate confusion matrix for SVM cm_svm = confusion_matrix(y_test, y_pred_svm) # Create heatmap using Plotly fig = go.Figure(data=go.Heatmap( z=cm_svm, x=target_names, y=target_names, hoverongaps=False, colorscale=[[0.0, '#e9ecef'], [0.25, '#a5d8ff'], [0.5, '#74c0fc'], [0.75, '#4dabf7'], [1.0, '#1c7ed6']], # Gray to Blue scale colorbar=dict(title='Count'))) fig.update_layout( title='SVM Confusion Matrix', xaxis_title="Predicted Label", yaxis_title="True Label", xaxis={'side': 'top'}, yaxis_autorange='reversed', # Standard orientation for confusion matrices width=500, height=450, # Adjust size as needed margin=dict(l=50, r=50, t=100, b=50) # Adjust margins ) # To display the plot (e.g., in a Jupyter environment or save to HTML) # fig.show() # If not in an interactive environment, print the raw matrix print("\n--- SVM Confusion Matrix ---") print(cm_svm) # Generate Plotly JSON for web embedding plotly_json_svm_cm = pio.to_json(fig) print(f"\n```plotly\n{plotly_json_svm_cm}\n```") # For embedding{"layout": {"title": {"text": "SVM Confusion Matrix"}, "xaxis": {"title": {"text": "Predicted Label"}, "side": "top", "tickvals": ["setosa", "versicolor", "virginica"], "ticktext": ["setosa", "versicolor", "virginica"]}, "yaxis": {"title": {"text": "True Label"}, "autorange": "reversed", "tickvals": ["setosa", "versicolor", "virginica"], "ticktext": ["setosa", "versicolor", "virginica"]}, "width": 500, "height": 450, "margin": {"l": 50, "r": 50, "t": 100, "b": 50}, "template": "plotly_white"}, "data": [{"z": [[15, 0, 0], [0, 14, 1], [0, 1, 14]], "x": ["setosa", "versicolor", "virginica"], "y": ["setosa", "versicolor", "virginica"], "hoverongaps": false, "type": "heatmap", "colorscale": [[0.0, "#e9ecef"], [0.25, "#a5d8ff"], [0.5, "#74c0fc"], [0.75, "#4dabf7"], [1.0, "#1c7ed6"]], "colorbar": {"title": {"text": "Count"}}}]}The confusion matrix for the SVM classifier. Rows represent the actual classes, and columns represent the predicted classes. The diagonal elements show correct predictions, while off-diagonal elements show misclassifications. For example, one 'versicolor' instance was misclassified as 'virginica', and one 'virginica' was misclassified as 'versicolor'. All 'setosa' instances were classified correctly.This visualization helps identify specific confusion patterns, such as which classes are most often mistaken for each other. You could generate similar matrices for Logistic Regression and KNN to compare their error patterns.Summary"In this practical exercise, you implemented three fundamental classification algorithms: Logistic Regression, K-Nearest Neighbors, and Support Vector Machines. You learned the standard Scikit-learn workflow: instantiate the model, fit it on training data (using scaled features where appropriate), predict on the test data, and evaluate using metrics like accuracy, precision, recall, F1-score, and the confusion matrix. You saw that even with default parameters, these models can achieve high accuracy on the Iris dataset. Remember that datasets often require more extensive preprocessing (Chapter 4) and careful model selection and tuning (Chapter 5) to achieve optimal results."