Now that we have explored the concepts behind Logistic Regression, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM), let's put them into practice. This section provides hands-on experience implementing these classifiers using Scikit-learn, evaluating their performance on a standard dataset, and interpreting the results. We assume you have a working Python environment with Scikit-learn, NumPy, and Pandas installed.
First, let's import the necessary libraries and load a dataset. We will use the well-known Iris dataset, which is conveniently included in Scikit-learn. This dataset contains measurements for 150 iris flowers belonging to three different species: setosa, versicolor, and virginica. The goal is to classify the species based on sepal length, sepal width, petal length, and petal width.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import plotly.graph_objects as go
import plotly.io as pio
# Configure Plotly for better display
pio.templates.default = "plotly_white"
# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
# Create a DataFrame for easier inspection (optional)
df = pd.DataFrame(X, columns=feature_names)
df['species'] = y
df['species_name'] = df['species'].map({i: name for i, name in enumerate(target_names)})
# print(df.head())
# print(f"Target names: {target_names}")
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
# print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")
# Feature Scaling
# Algorithms like KNN and SVM are sensitive to feature scales.
# Logistic Regression can also benefit.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
We load the data, separate features (X
) and the target variable (y
), and then split them into training and testing sets using train_test_split
. The stratify=y
argument ensures that the proportion of each class is approximately the same in both the training and testing sets, which is important for classification tasks, especially with imbalanced datasets (though Iris is balanced). Finally, we apply StandardScaler
to standardize the features by removing the mean and scaling to unit variance. Note that we fit
the scaler only on the training data and then transform
both the training and testing data to prevent information leakage from the test set.
Logistic Regression is a linear model commonly used for binary classification, but Scikit-learn's implementation also supports multi-class problems (like Iris) using a one-vs-rest (OvR) or multinomial scheme.
# Initialize and train the Logistic Regression model
log_reg = LogisticRegression(random_state=42, multi_class='ovr', solver='liblinear') # Using scaled data
log_reg.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_log_reg = log_reg.predict(X_test_scaled)
# Evaluate the model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
report_log_reg = classification_report(y_test, y_pred_log_reg, target_names=target_names)
print("--- Logistic Regression Evaluation ---")
print(f"Accuracy: {accuracy_log_reg:.4f}")
print("Classification Report:")
print(report_log_reg)
Here, we initialize LogisticRegression
. We specify multi_class='ovr'
(One-vs-Rest) and choose a solver
suitable for this dataset ('liblinear' works well for smaller datasets). We train the model using the scaled training data (X_train_scaled
, y_train
) and then predict labels for the scaled test data (X_test_scaled
). Finally, we calculate the accuracy and generate a classification report, which includes precision, recall, and F1-score for each class.
KNN classifies a data point based on the majority class among its 'k' nearest neighbors in the feature space. The choice of 'k' and the distance metric are important considerations. Since KNN relies on distance calculations, feature scaling is generally required.
# Initialize and train the KNN model
# Let's start with k=5
knn = KNeighborsClassifier(n_neighbors=5) # Using scaled data
knn.fit(X_train_scaled, y_train)
# Make predictions
y_pred_knn = knn.predict(X_test_scaled)
# Evaluate the model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
report_knn = classification_report(y_test, y_pred_knn, target_names=target_names)
print("\n--- K-Nearest Neighbors (k=5) Evaluation ---")
print(f"Accuracy: {accuracy_knn:.4f}")
print("Classification Report:")
print(report_knn)
We initialize KNeighborsClassifier
with n_neighbors=5
(a common starting point for 'k'). We train it on the scaled training data and evaluate it on the scaled test data, just like with Logistic Regression. The performance of KNN can be sensitive to the value of k
; experimentation (often using techniques like cross-validation discussed in Chapter 5) is usually needed to find an optimal value.
SVMs aim to find the optimal hyperplane that separates different classes in the feature space. We'll use the SVC
(Support Vector Classifier) class from Scikit-learn. SVMs also typically require scaled features.
# Initialize and train the SVM model
# Using default parameters (RBF kernel, C=1.0)
svm_clf = SVC(random_state=42) # Using scaled data
svm_clf.fit(X_train_scaled, y_train)
# Make predictions
y_pred_svm = svm_clf.predict(X_test_scaled)
# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm, target_names=target_names)
print("\n--- Support Vector Machine (SVC) Evaluation ---")
print(f"Accuracy: {accuracy_svm:.4f}")
print("Classification Report:")
print(report_svm)
We initialize SVC
using its default parameters, which include the Radial Basis Function (RBF) kernel. Training and evaluation follow the same pattern as before, using the scaled data. SVMs have several hyperparameters (like C
and gamma
for the RBF kernel) that significantly influence performance; tuning these is covered in later chapters.
A confusion matrix provides a more detailed breakdown of classification performance than accuracy alone. It shows the number of correct and incorrect predictions for each class. Let's visualize the confusion matrix for the SVM model using Plotly.
# Calculate confusion matrix for SVM
cm_svm = confusion_matrix(y_test, y_pred_svm)
# Create heatmap using Plotly
fig = go.Figure(data=go.Heatmap(
z=cm_svm,
x=target_names,
y=target_names,
hoverongaps=False,
colorscale=[[0.0, '#e9ecef'], [0.25, '#a5d8ff'], [0.5, '#74c0fc'], [0.75, '#4dabf7'], [1.0, '#1c7ed6']], # Gray to Blue scale
colorbar=dict(title='Count')))
fig.update_layout(
title='SVM Confusion Matrix',
xaxis_title="Predicted Label",
yaxis_title="True Label",
xaxis={'side': 'top'},
yaxis_autorange='reversed', # Standard orientation for confusion matrices
width=500, height=450, # Adjust size as needed
margin=dict(l=50, r=50, t=100, b=50) # Adjust margins
)
# To display the plot (e.g., in a Jupyter environment or save to HTML)
# fig.show()
# If not in an interactive environment, print the raw matrix
print("\n--- SVM Confusion Matrix ---")
print(cm_svm)
# Generate Plotly JSON for web embedding
plotly_json_svm_cm = pio.to_json(fig)
print(f"\n```plotly\n{plotly_json_svm_cm}\n```") # For embedding
The confusion matrix for the SVM classifier. Rows represent the actual classes, and columns represent the predicted classes. The diagonal elements show correct predictions, while off-diagonal elements show misclassifications. For example, one 'versicolor' instance was misclassified as 'virginica', and one 'virginica' was misclassified as 'versicolor'. All 'setosa' instances were classified correctly.
This visualization helps identify specific confusion patterns, such as which classes are most often mistaken for each other. You could generate similar matrices for Logistic Regression and KNN to compare their error patterns.
In this practical exercise, you implemented three fundamental classification algorithms: Logistic Regression, K-Nearest Neighbors, and Support Vector Machines. You learned the standard Scikit-learn workflow: instantiate the model, fit
it on training data (using scaled features where appropriate), predict
on the test data, and evaluate using metrics like accuracy, precision, recall, F1-score, and the confusion matrix. You saw that even with default parameters, these models can achieve high accuracy on the Iris dataset. Remember that real-world datasets often require more extensive preprocessing (Chapter 4) and careful model selection and tuning (Chapter 5) to achieve optimal results.
© 2025 ApX Machine Learning