Finding Similar Users and Items with k-Nearest Neighbors

Identifying similar users or items is a fundamental task in developing recommendation systems, especially when working with user-item interaction matrices. The core of neighborhood-based methods relies on this very idea. An algorithm is needed that can take a user or an item, compare it to all others, and return a ranked list of its closest peers, or its 'neighborhood.' This is precisely the job of the k-Nearest Neighbors (k-NN) algorithm.

What is k-Nearest Neighbors?

K-Nearest Neighbors is a straightforward and intuitive algorithm. It operates on the principle that things that are close to each other in a given space are likely to be similar. In the context of our user-item matrix, we can think of each user or each item as a point in a multi-dimensional space.

For user-based filtering, each user is a vector, and the dimensions of the space are all the items in the catalog. The value along each dimension is the user's rating for that item.
For item-based filtering, each item is a vector, and the dimensions are all the users. The value along each dimension is the rating that user gave to the item.

The k-NN algorithm finds the $k$ points (neighbors) in this space that are closest to a target point. The "distance" between these points is calculated using a similarity metric, which we will cover in the next section. The parameter $k$ is a number you choose; for instance, if you set $k=5$ , the algorithm will find the five closest neighbors.

Choosing the right value for $k$ is a balancing act. A very small $k$ (e.g., $k=3$ ) can make your recommendations sensitive to noise; a single user with eccentric tastes might heavily influence the outcome. A very large $k$ might smooth out the predictions too much, including neighbors who are not genuinely similar and diluting the preferences of the most relevant ones. The optimal $k$ is typically found through experimentation and evaluation.

Finding a User's Neighborhood

Let's illustrate this with a user-based example. Imagine we want to find recommendations for a user named Alex. First, we represent Alex as a vector of their ratings. Then, we apply the k-NN algorithm to find other users whose rating vectors are most similar to Alex's. These users form Alex's neighborhood. The assumption is that items liked by this neighborhood, which Alex has not yet seen, are excellent candidates for recommendation.

Alex's neighborhood for $k=3$ . The algorithm identifies Ben, Chloe, and David as the nearest neighbors based on a distance metric, while Eva and Frank are too dissimilar to be included.

Finding an Item's Neighborhood

The same logic applies to item-based filtering, but our perspective shifts. Instead of finding similar users, we find similar items. We represent each item as a vector of ratings it has received from all users. To recommend something to Alex, we would look at the items Alex has already rated highly, like "Movie A". We then use k-NN to find the $k$ movies most similar to "Movie A" based on how the entire user population has rated them. These neighboring movies, which Alex has not yet seen, are then recommended.

This item-based approach is often preferred in practice. User tastes can change over time, but the relationship between items is generally more stable. For example, people who like Star Wars: A New Hope are consistently likely to also enjoy The Empire Strikes Back. Furthermore, since the number of items is often much smaller than the number of users, computing item-item similarity can be more computationally efficient.

A Practical Example with Scikit-learn

While it's important to understand the mechanics, libraries like Scikit-learn provide efficient implementations of k-NN. Let's see how to find the neighbors for the first user in a sparse user-item matrix.

First, we need some data. We'll use a csr_matrix from SciPy, which is a great format for sparse data where most of the entries are zero.

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

# User-item matrix: 4 users, 5 items
# A value of 0 means the user has not rated the item.
user_item_data = np.array([
    [5, 1, 0, 0, 3],
    [4, 1, 2, 0, 0],
    [0, 0, 5, 4, 1],
    [0, 2, 4, 5, 0]
])

# Create a sparse matrix
user_item_sparse = csr_matrix(user_item_data)

print(user_item_sparse)

  (0, 0)	5
  (0, 1)	1
  (0, 4)	3
  (1, 0)	4
  (1, 1)	1
  (1, 2)	2
  (2, 2)	5
  (2, 3)	4
  (2, 4)	1
  (3, 1)	2
  (3, 2)	4
  (3, 3)	5

Now, let's configure the NearestNeighbors model to find the 2 closest neighbors for each user. We'll use cosine similarity as our distance metric (note: Scikit-learn's NearestNeighbors uses distance, which is $1 - similarity$ ).

# We want to find 2 neighbors (excluding the user themselves)
k = 3 

# Initialize the model for user-based filtering
# metric='cosine' calculates cosine distance
nn_model = NearestNeighbors(n_neighbors=k, metric='cosine', algorithm='brute')

# Fit the model to our user-item matrix
nn_model.fit(user_item_sparse)

# Let's find neighbors for the first user (index 0)
target_user_index = 0
target_user_vector = user_item_sparse[target_user_index]

# Find the nearest neighbors
distances, indices = nn_model.kneighbors(target_user_vector)

print("Query User Index:", target_user_index)
print("Neighbor Indices:", indices)
print("Distances:", distances)

The output will look something like this:

Query User Index: 0
Neighbor Indices: [[0 1 3]]
Distances: [[0.         0.51867123 0.91574913]]

Here's how to interpret the results for our target user (index 0):

indices: The array [[0 1 3]] contains the indices of the $k=3$ nearest neighbors. The first result, 0, is the user themselves, as their distance to their own vector is always zero. The other neighbors are user 1 and user 3.
distances: The array [[0. 0.518... 0.915...]] contains the corresponding cosine distances. User 1 is closer (distance of ~0.52) than user 3 (distance of ~0.92).

With these neighbors identified, we have the foundation for making a prediction. The next step, which we'll explore in the following sections, is to use the ratings from these neighbors to estimate what our target user might rate an item they haven't seen. We will also examine the similarity metrics that power this process in much greater detail.

References

Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006 (Springer) - Comprehensive textbook providing a mathematical foundation for machine learning, including k-Nearest Neighbors algorithm principles.
Item-based collaborative filtering recommendation algorithms, Badrul Sarwar, George Karypis, Joseph A Konstan, John Riedl, 2001 Proceedings of the 10th International Conference on World Wide Web (WWW '01) (Association for Computing Machinery (ACM)) DOI: 10.1145/371920.372071 - Foundational paper that introduced and evaluated item-based collaborative filtering, a widely adopted approach for recommendation systems.
sklearn.neighbors.NearestNeighbors, The scikit-learn team, 2024 - Official documentation for Scikit-learn's k-Nearest Neighbors algorithm implementation, including parameters and usage examples for finding neighbors.