Moving beyond linear models like Logistic Regression, we encounter a different family of classification algorithms known as instance-based or non-parametric learners. The K-Nearest Neighbors (KNN) algorithm is a prime example and one of the simplest, yet often surprisingly effective, classification methods.
Unlike models that learn explicit decision boundaries by estimating parameters (like the coefficients in Logistic Regression), KNN operates on a principle of similarity. At its core, KNN classifies a new data point based on the majority class among its 'K' closest neighbors in the feature space.
How KNN Works: Classification by Proximity
Imagine you have a dataset plotted in a multi-dimensional space, where each axis represents a feature. Each point in this space belongs to a known class. When a new, unclassified data point arrives, KNN follows these steps:
Calculate Distances: It computes the distance between the new point and every existing point in the training dataset. The most common distance metric used is the Euclidean distance, although others like Manhattan distance can also be employed. For two points p=(p1,p2,...,pn) and q=(q1,q2,...,qn) in an n-dimensional feature space, the Euclidean distance is:
d(p,q)=(p1−q1)2+(p2−q2)2+...+(pn−qn)2
Identify Neighbors: It identifies the 'K' data points from the training set that are closest (have the smallest distances) to the new point. These are its 'K' nearest neighbors.
Majority Vote: It looks at the classes of these 'K' neighbors. The new data point is assigned the class that represents the majority among its neighbors. For instance, if K=5 and 3 neighbors belong to Class A and 2 belong to Class B, the new point is classified as Class A. In case of a tie (e.g., K=4, with 2 neighbors of Class A and 2 of Class B), the tie is often broken arbitrarily or by considering distances.
The following visualization illustrates this concept for K=3.
The new yellow star point needs classification. Its 3 nearest neighbors (circled in green) are identified. Two neighbors belong to Class A (blue) and one belongs to Class B (pink). By majority vote (2 vs 1), the new point is classified as Class A.
Choosing the Value of 'K'
The number of neighbors, 'K', is a hyperparameter that significantly influences the model's behavior.
Small 'K' (e.g., K=1): The model is very sensitive to noise and outliers. The decision boundary can be highly irregular, potentially leading to high variance and overfitting.
Large 'K': The model becomes smoother and less sensitive to local variations. The decision boundary becomes more generalized. However, if 'K' is too large (e.g., K equals the total number of data points), the model might become too biased and predict the majority class for all new points, leading to high bias and underfitting.
Choosing the optimal 'K' often involves experimentation and using techniques like cross-validation (which we will cover in Chapter 5) to evaluate performance for different 'K' values. It's typically recommended to choose an odd value for 'K' in binary classification problems to avoid ties.
Importance of Feature Scaling
A critical aspect of using KNN is the sensitivity to the scale of features. Since KNN relies on distance calculations, features with larger numerical ranges can disproportionately dominate the distance metric, effectively diminishing the influence of features with smaller ranges. For example, if one feature ranges from 0 to 1000 and another from 0 to 1, the first feature will almost entirely determine the distance calculation.
Therefore, it is almost always necessary to scale your features before applying KNN. Common scaling techniques include Standardization (scaling to zero mean and unit variance) using StandardScaler or Normalization (scaling to a range, typically [0, 1]) using MinMaxScaler. We will explore these techniques in detail in Chapter 4: Data Preprocessing and Feature Engineering.
Strengths and Weaknesses of KNN
Strengths:
Simplicity: The algorithm is easy to understand and implement.
Non-parametric: It makes no strong assumptions about the underlying data distribution. This allows it to learn complex decision boundaries.
Adaptability: As new training data becomes available, the model can adapt easily without retraining from scratch (though finding neighbors becomes computationally more intensive).
Weaknesses:
Computational Cost: Calculating distances to all training points can be slow and memory-intensive, especially for large datasets (high number of samples or features). This is often referred to as the "curse of dimensionality."
Sensitivity to Irrelevant Features: Features that don't help distinguish between classes can degrade performance as they still contribute to the distance calculation. Feature selection or dimensionality reduction techniques can be beneficial.
Sensitivity to Feature Scaling: As mentioned, performance heavily depends on appropriate feature scaling.
Need for Optimal 'K': The performance is sensitive to the choice of 'K'.
KNN provides a fundamentally different approach to classification based on instance similarity. In the next section, we will see how to implement the KNeighborsClassifier using Scikit-learn and apply it to practical classification problems.