All Courses

Implementing KNN with Scikit-learn

Now that you understand the intuition behind the K-Nearest Neighbors (KNN) algorithm, classifying a data point based on the majority class among its closest neighbors, let's look at how to put it into practice using Scikit-learn. The library provides a straightforward implementation through the KNeighborsClassifier class, located within the sklearn.neighbors module.

Using KNeighborsClassifier

The primary tool for KNN classification in Scikit-learn is KNeighborsClassifier. Like other Scikit-learn estimators, it follows the familiar fit/predict pattern.

First, you need to import the necessary components:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler # We'll see why this is important
from sklearn.datasets import load_iris # A sample dataset

Let's use the Iris dataset, a classic classification benchmark, to demonstrate. This dataset contains measurements for three species of Iris flowers.

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Output shapes to verify the split
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Now, we instantiate the KNeighborsClassifier. The most significant hyperparameter you'll specify is n_neighbors, which corresponds to the $k$ value in the KNN algorithm. This determines how many neighbors are considered when making a prediction. Let's start with $k=5$ .

# Instantiate the classifier with k=5
knn = KNeighborsClassifier(n_neighbors=5)

Training the KNN model is simple. Since KNN is an instance-based learner, the fit method primarily involves storing the training data ( $X_{train}$ and $y_{train}$ ) in an efficient structure (like a Ball Tree or KD Tree by default) to allow for fast querying of nearest neighbors later.

# Train the classifier (store the training data)
knn.fit(X_train, y_train)

With the model "trained" (i.e., the training data stored), we can make predictions on new, unseen data ( $X_{test}$ ). The predict method finds the $k$ nearest neighbors in the training data for each point in the test set and assigns the most frequent class among those neighbors.

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Compare predictions to the actual labels
print("Sample Predictions:", y_pred[:10])
print("Actual Labels:   ", y_test[:10])

We can evaluate the model's performance using metrics suitable for classification, such as accuracy.

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

The Importance of Feature Scaling

KNN relies heavily on calculating distances between data points (typically Euclidean distance by default). If features have vastly different scales (e.g., one feature ranges from 0 to 1, while another ranges from 1000 to 50000), the feature with the larger scale will dominate the distance calculation. This can lead to suboptimal performance, as the algorithm might implicitly assign more importance to features with larger values, regardless of their actual predictive significance.

Therefore, it's almost always recommended to scale your features before applying KNN. A common technique is standardization, which transforms features to have zero mean and unit variance. Scikit-learn's StandardScaler is ideal for this.

Let's apply scaling and retrain the model:

# Instantiate the scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use transform, not fit_transform, on test data!

# Instantiate a new KNN classifier
knn_scaled = KNeighborsClassifier(n_neighbors=5)

# Train on the scaled data
knn_scaled.fit(X_train_scaled, y_train)

# Predict on the scaled test data
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Evaluate the scaled model
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Scaled Model Accuracy: {accuracy_scaled:.4f}")

You'll often observe an improvement in accuracy after scaling the features, especially if the original features had different ranges. Remember to fit the scaler only on the training data and then use it to transform both the training and test sets to prevent data leakage from the test set into the training process.

Choosing the Right $k$

The choice of $k$ (n_neighbors) significantly impacts the model's behavior.

A small $k$ (e.g., $k=1$ ) makes the model sensitive to noise and outliers, potentially leading to a complex decision boundary (high variance, low bias).
A large $k$ smooths the decision boundary, making the model more resistant to noise but potentially ignoring local patterns (low variance, high bias).

Finding the optimal $k$ often involves trying different values and evaluating their performance, typically using cross-validation techniques (which we will cover in Chapter 5). For now, be aware that $k=5$ is a common starting point, but it might not be the best choice for every dataset.

Other potentially useful hyperparameters for KNeighborsClassifier include:

weights: Determines how neighbors' votes are weighted. 'uniform' (default) gives equal weight to all neighbors. 'distance' assigns weights inversely proportional to the distance, meaning closer neighbors have more influence.
metric: Specifies the distance metric to use (e.g., 'minkowski' with p=2 for Euclidean, p=1 for Manhattan).

Experimenting with these hyperparameters, particularly $k$ , is a standard part of building an effective KNN model. We'll explore systematic ways to tune hyperparameters later in the course. For now, you have the tools to implement and evaluate a basic KNN classifier using Scikit-learn.

Was this section helpful?

Implementing KNN with Scikit-learn

Using KNeighborsClassifier

The Importance of Feature Scaling

Choosing the Right kkk

Choosing the Right $k$