To build effective recommendation systems, quantifying the relationships between users or items is essential. Similarity metrics provide a formal method to measure how alike they are. These functions compare two vectors, representing two users or two items, and produce a score indicating their degree of similarity. The selection of a metric represents a design decision that can significantly influence a recommender's performance.
We will focus on two of the most widely used similarity metrics in collaborative filtering: Cosine Similarity and Pearson Correlation.
Cosine similarity measures the cosine of the angle between two non-zero vectors. In the context of recommendation systems, it evaluates the orientation of two users' or items' rating vectors, rather than their magnitude. This is particularly useful because it captures the pattern of preferences, ignoring differences in rating scales. For example, one user might be a harsh critic who rates everything between 1 and 3, while another is more generous, rating between 3 and 5. Cosine similarity can still identify them as similar if the shape of their ratings across items is alike.
The formula for cosine similarity between two vectors and is:
The score ranges from -1 (exactly opposite) to 1 (exactly the same), with 0 indicating orthogonality or no correlation. In most user-item rating scenarios where ratings are non-negative, the score will range from 0 to 1.
Let's consider a simple user-based example with two users, Alice and Bob, who have both rated three movies.
| Movie | Alice's Rating | Bob's Rating |
|---|---|---|
| Movie 1 | 5 | 4 |
| Movie 2 | 3 | 2 |
| Movie 3 | 4 | 3 |
Their rating vectors are and . To calculate the cosine similarity:
Calculate the dot product ():
Calculate the magnitude of each vector ( and ):
Compute the similarity:
A score of 0.997 indicates that Alice and Bob have very similar taste in movies, as their rating vectors point in almost the same direction.
While cosine similarity is effective, it has a limitation: it doesn't account for differences in rating scales. For example, if User A rates items [1, 2, 3] and User B rates them [3, 4, 5], they have a perfect linear relationship, but cosine similarity won't capture this as strongly as it could.
Pearson correlation coefficient solves this by first centering the data. It determines the extent to which two variables are linearly related. In our case, it measures the linearity of ratings given by two users or received by two items. It is effectively a mean-centered version of cosine similarity.
The formula for Pearson correlation between two users and is:
Where:
The score ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative relationship, and 0 indicates no linear relationship.
Let's revisit our example with a new user, Carol, who is a harsher critic.
| Movie | Alice's Rating | Carol's Rating |
|---|---|---|
| Movie 1 | 5 | 3 |
| Movie 2 | 3 | 1 |
| Movie 3 | 4 | 2 |
Calculate average ratings:
Create mean-centered rating vectors:
As you can see, their mean-centered vectors are identical. If we were to calculate the Pearson correlation (or the cosine similarity of these centered vectors), the result would be 1. This indicates a perfect linear correlation, capturing the fact that their preferences are structured identically, even though their raw scores differ. Pearson correlation successfully identified their similarity by removing their individual rating biases.
The diagram illustrates how mean-centering aligns user ratings. Before centering, Alice's and Carol's ratings exist at different points. After subtracting their respective average ratings, their preference vectors align perfectly, revealing their underlying similarity.
Your choice between these two metrics often depends on the nature of your data.
In practice, you rarely need to implement these formulas from scratch. Libraries like Scikit-learn provide efficient functions to compute similarity matrices. For example, you can compute a cosine similarity matrix for all items in your user-item matrix R_items with a single line of code.
from sklearn.metrics.pairwise import cosine_similarity
# Assume R_items is a matrix where rows are items and columns are users
# It might be a pandas DataFrame or a NumPy array
# For example:
# R_items = [
# [5, 3, 0], # Item 1 ratings by User A, B, C
# [4, 0, 2], # Item 2 ratings
# [0, 2, 5] # Item 3 ratings
# ]
# Calculate the similarity between all pairs of items
item_similarity_matrix = cosine_similarity(R_items)
print(item_similarity_matrix)
# Output might look like:
# [[1. 0.78 0. ]
# [0.78 1. 0.55 ]
# [0. 0.55 1. ]]
This matrix gives you the similarity score between every pair of items, forming the basis for finding the nearest neighbors.
With these methods for quantifying similarity, you are now equipped to find the most relevant neighbors for any user or item. The next step is to use the information from these neighbors to generate concrete predictions.
Was this section helpful?
cosine_similarity function in the scikit-learn library, essential for practical implementation.© 2026 ApX Machine LearningEngineered with