Okay, let's put the concepts we've learned about K-Nearest Neighbors into practice. In the previous sections, we discussed how KNN works by finding the 'k' closest training examples to a new data point and making a prediction based on the majority class among those neighbors. Now, we'll walk through implementing a KNN classifier using a common Python library, Scikit-learn, on a well-known dataset.
We'll use the famous Iris dataset. This dataset contains measurements for 150 iris flowers belonging to three different species: Setosa, Versicolor, and Virginica. For each flower, we have four features:
Our objective is to build a KNN model that can predict the species of an iris flower based on these four measurements. This is a classic example of a multi-class classification problem.
We'll use Python and the Scikit-learn library. If you haven't used Scikit-learn before, it's a powerful and widely used library for machine learning tasks. You'll also need libraries like NumPy for numerical operations and Matplotlib/Seaborn for plotting (optional, but helpful for understanding).
Make sure you have these installed. You can typically install them using pip:
pip install scikit-learn numpy matplotlib seaborn pandas
Scikit-learn conveniently includes the Iris dataset. Let's load it.
import pandas as pd
from sklearn.datasets import load_iris
import numpy as np
# Load the Iris dataset
iris = load_iris()
# The dataset is loaded as a Bunch object (similar to a dictionary)
# iris.data contains the features (numpy array)
# iris.target contains the labels (0, 1, 2 corresponding to species)
# iris.feature_names contains the names of the features
# iris.target_names contains the names of the species
# For easier handling, let's put it into a Pandas DataFrame
# This is optional but often convenient
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
# Map target numbers to species names for clarity
df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
print("First 5 rows of the Iris dataset:")
print(df.head())
print("\nTarget classes (Species):")
print(df['species'].unique())
# Separate features (X) and target (y)
X = iris.data # Features (numpy array)
y = iris.target # Target labels (numpy array)
You should see the first few rows of data, showing the measurements and the corresponding target label (0, 1, or 2) and species name.
As discussed in Chapter 2 and revisited in Chapter 6, we need to split our data. We'll train the model on one portion (the training set) and evaluate its performance on a separate, unseen portion (the testing set). This helps us understand how well our model generalizes to new data.
Scikit-learn provides a handy function train_test_split
for this.
from sklearn.model_selection import train_test_split
# Split data into training and testing sets
# test_size=0.3 means 30% of the data will be used for testing
# random_state ensures reproducibility (we get the same split every time)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
We use stratify=y
to ensure that the proportion of each flower species is roughly the same in both the training and testing sets, which is good practice for classification tasks.
Remember from Chapter 6 that KNN relies on distance calculations (like Euclidean distance) between data points. If features have vastly different scales (e.g., one feature ranges from 0-1 and another from 100-1000), the feature with the larger range can dominate the distance calculation. Therefore, scaling features to a similar range is often important for KNN. We'll use StandardScaler
from Scikit-learn, which standardizes features by removing the mean and scaling to unit variance.
from sklearn.preprocessing import StandardScaler
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler ONLY on the training data
scaler.fit(X_train)
# Transform both the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Note: It's important to fit the scaler only on the training data
# and then use that fitted scaler to transform both sets.
# This prevents information from the test set "leaking" into the training process.
# Let's look at the first few rows of the scaled data (optional)
# print("\nFirst 5 rows of scaled training data:")
# print(X_train_scaled[:5])
Now we can create our KNN classifier. The main parameter we need to choose is n_neighbors
, which is the 'k' value we discussed. Let's start with a common value, like k=5
.
from sklearn.neighbors import KNeighborsClassifier
# Initialize the KNN classifier with k=5
knn = KNeighborsClassifier(n_neighbors=5)
# Train the model using the scaled training data
knn.fit(X_train_scaled, y_train)
print("\nKNN model trained successfully with k=5.")
The fit
method is where the "learning" happens for many Scikit-learn models. For KNN, however, fit
is very simple: it primarily just stores the training data (X_train_scaled
and y_train
) so it can be referenced later when making predictions.
With our trained model, we can now predict the species for the flowers in our test set (X_test_scaled
).
# Make predictions on the scaled test data
y_pred = knn.predict(X_test_scaled)
# Display the first 10 predictions alongside the actual labels
print("\nFirst 10 Predictions vs Actual Labels:")
print(f"Predictions: {y_pred[:10]}")
print(f"Actual: {y_test[:10]}")
# Remember: 0=setosa, 1=versicolor, 2=virginica
The predict
method takes the new data points (our scaled test features) and, for each point, finds the 5 nearest neighbors in the stored training data. It then predicts the class based on the majority vote among those neighbors.
How well did our model do? We need to compare the predictions (y_pred
) with the actual labels (y_test
). We learned about evaluation metrics in the previous section. Let's calculate accuracy and look at the confusion matrix.
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Visualize the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(cmap=plt.cm.Blues) # Use a blue color map
plt.title("Confusion Matrix for KNN (k=5)")
plt.show()
A confusion matrix showing the performance of the KNN classifier on the Iris test set. Rows represent the true classes, and columns represent the predicted classes. Diagonal elements show correct predictions.
The accuracy tells us the overall proportion of correct predictions. The confusion matrix gives a more detailed breakdown:
In this case (results may vary slightly depending on the random_state
), the KNN model with k=5 usually performs very well on the Iris dataset, often achieving high accuracy with few misclassifications shown in the confusion matrix.
The choice of k
(the number of neighbors) can influence the model's performance. A small k
might make the model sensitive to noise, while a very large k
might oversmooth the decision boundary.
Try changing the n_neighbors
parameter when creating the KNeighborsClassifier
(e.g., try k=1
, k=3
, k=10
) and rerun steps 4, 5, and 6. Observe how the accuracy and confusion matrix change. Finding the optimal k
often involves trying several values and seeing which one performs best on a validation set (or using techniques like cross-validation, which are slightly more advanced topics).
For instance, let's quickly check k=3:
# Initialize, train, predict, and evaluate for k=3
knn_k3 = KNeighborsClassifier(n_neighbors=3)
knn_k3.fit(X_train_scaled, y_train)
y_pred_k3 = knn_k3.predict(X_test_scaled)
accuracy_k3 = accuracy_score(y_test, y_pred_k3)
print(f"\nModel Accuracy with k=3: {accuracy_k3:.4f}")
cm_k3 = confusion_matrix(y_test, y_pred_k3)
disp_k3 = ConfusionMatrixDisplay(confusion_matrix=cm_k3, display_labels=iris.target_names)
disp_k3.plot(cmap=plt.cm.Greens) # Use a green color map this time
plt.title("Confusion Matrix for KNN (k=3)")
plt.show()
A confusion matrix showing the performance of the KNN classifier with k=3 on the Iris test set.
Compare the results. Does k=3
perform better or worse than k=5
on this specific test set? There isn't always one "best" k
for all datasets; it often depends on the data's structure.
In this practice section, you successfully implemented a K-Nearest Neighbors classifier:
KNeighborsClassifier
instance from Scikit-learn.k
can affect results.This hands-on exercise demonstrates the typical workflow for applying a supervised learning algorithm to a classification problem using standard tools. You now have practical experience implementing one of the fundamental classification algorithms.
© 2025 ApX Machine Learning