Calculating Similarity Metrics

To build effective recommendation systems, quantifying the relationships between users or items is essential. Similarity metrics provide a formal method to measure how alike they are. These functions compare two vectors, representing two users or two items, and produce a score indicating their degree of similarity. The selection of a metric represents a design decision that can significantly influence a recommender's performance.

We will focus on two of the most widely used similarity metrics in collaborative filtering: Cosine Similarity and Pearson Correlation.

Cosine Similarity: Measuring Direction

Cosine similarity measures the cosine of the angle between two non-zero vectors. In the context of recommendation systems, it evaluates the orientation of two users' or items' rating vectors, rather than their magnitude. This is particularly useful because it captures the pattern of preferences, ignoring differences in rating scales. For example, one user might be a harsh critic who rates everything between 1 and 3, while another is more generous, rating between 3 and 5. Cosine similarity can still identify them as similar if the shape of their ratings across items is alike.

The formula for cosine similarity between two vectors $A$ and $B$ is:

\text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}

The score ranges from -1 (exactly opposite) to 1 (exactly the same), with 0 indicating orthogonality or no correlation. In most user-item rating scenarios where ratings are non-negative, the score will range from 0 to 1.

Let's examine a simple user-based example with two users, Alice and Bob, who have both rated three movies.

Movie	Alice's Rating	Bob's Rating
Movie 1	5	4
Movie 2	3	2
Movie 3	4	3

Their rating vectors are $A = [5, 3, 4]$ and $B = [4, 2, 3]$ . To calculate the cosine similarity:

Calculate the dot product ( $A \cdot B$ ): $(5 \times 4) + (3 \times 2) + (4 \times 3) = 20 + 6 + 12 = 38$
Calculate the magnitude of each vector ( $\|A\|$ and $\|B\|$ ): $\|A\| = \sqrt{5^2 + 3^2 + 4^2} = \sqrt{25 + 9 + 16} = \sqrt{50} \approx 7.07$ $\|B\| = \sqrt{4^2 + 2^2 + 3^2} = \sqrt{16 + 4 + 9} = \sqrt{29} \approx 5.39$
Compute the similarity: $\text{similarity} = \frac{38}{7.07 \times 5.39} \approx \frac{38}{38.11} \approx 0.997$

A score of 0.997 indicates that Alice and Bob have very similar taste in movies, as their rating vectors point in almost the same direction.

Pearson Correlation: Accounting for Rating Bias

While cosine similarity is effective, it has a limitation: it doesn't account for differences in rating scales. For example, if User A rates items [1, 2, 3] and User B rates them [3, 4, 5], they have a perfect linear relationship, but cosine similarity won't capture this as strongly as it could.

Pearson correlation coefficient solves this by first centering the data. It determines the extent to which two variables are linearly related. In our case, it measures the linearity of ratings given by two users or received by two items. It is effectively a mean-centered version of cosine similarity.

The formula for Pearson correlation between two users $u$ and $v$ is:

\text{corr}(u, v) = \frac{\sum_{i \in I_{uv}} (r_{u,i} - \bar{r}_u)(r_{v,i} - \bar{r}_v)}{\sqrt{\sum_{i \in I_{uv}} (r_{u,i} - \bar{r}_u)^2} \sqrt{\sum_{i \in I_{uv}} (r_{v,i} - \bar{r}_v)^2}}

Where:

$I_{uv}$ is the set of items rated by both user $u$ and user $v$ .
$r_{u,i}$ is the rating of user $u$ for item $i$ .
$\bar{r}_u$ is the average rating given by user $u$ .

The score ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative relationship, and 0 indicates no linear relationship.

Let's revisit our example with a new user, Carol, who is a harsher critic.

Movie	Alice's Rating	Carol's Rating
Movie 1	5	3
Movie 2	3	1
Movie 3	4	2

Calculate average ratings:
- Alice's average: $\bar{r}_{Alice} = (5 + 3 + 4) / 3 = 4$
- Carol's average: $\bar{r}_{Carol} = (3 + 1 + 2) / 3 = 2$
Create mean-centered rating vectors:
- Alice's centered vector: $[5-4, 3-4, 4-4] = [1, -1, 0]$
- Carol's centered vector: $[3-2, 1-2, 2-2] = [1, -1, 0]$

As you can see, their mean-centered vectors are identical. If we were to calculate the Pearson correlation (or the cosine similarity of these centered vectors), the result would be 1. This indicates a perfect linear correlation, capturing the fact that their preferences are structured identically, even though their raw scores differ. Pearson correlation successfully identified their similarity by removing their individual rating biases.

The diagram illustrates how mean-centering aligns user ratings. Before centering, Alice's and Carol's ratings exist at different points. After subtracting their respective average ratings, their preference vectors align perfectly, revealing their underlying similarity.

Cosine vs. Pearson: Which to Choose?

Your choice between these two metrics often depends on the nature of your data.

Use Pearson Correlation when you are working with explicit rating data (e.g., 1-5 stars) and user rating bias is a concern. It is generally more effective for this type of data because it normalizes for user tendencies to rate high or low.
Use Cosine Similarity when your data is sparse and when the magnitude of the values is meaningful. It is also a good choice for implicit feedback data, where interactions are binary (e.g., viewed/not viewed), as rating bias is not a factor. It is often computationally simpler than Pearson.

Implementation with Scikit-learn

In practice, you rarely need to implement these formulas from scratch. Libraries like Scikit-learn provide efficient functions to compute similarity matrices. For example, you can compute a cosine similarity matrix for all items in your user-item matrix R_items with a single line of code.

from sklearn.metrics.pairwise import cosine_similarity

# Assume R_items is a matrix where rows are items and columns are users
# It might be a pandas DataFrame or a NumPy array
# For example:
# R_items = [
#   [5, 3, 0],  # Item 1 ratings by User A, B, C
#   [4, 0, 2],  # Item 2 ratings
#   [0, 2, 5]   # Item 3 ratings
# ]

# Calculate the similarity between all pairs of items
item_similarity_matrix = cosine_similarity(R_items)

print(item_similarity_matrix)
# Output might look like:
# [[1.         0.78       0.        ]
#  [0.78       1.         0.55      ]
#  [0.         0.55       1.        ]]

This matrix gives you the similarity score between every pair of items, forming the basis for finding the nearest neighbors.

With these methods for quantifying similarity, you are now equipped to find the most relevant neighbors for any user or item. The next step is to use the information from these neighbors to generate concrete predictions.

Was this section helpful?

References

Recommender Systems: An Introduction, Francesco Ricci, Lior Rokach, Bracha Shapira, 2022 (Springer) DOI: 10.1007/978-1-0716-2197-4 - A comprehensive book covering collaborative filtering, similarity metrics, and their application in recommendation systems. (4th edition)
Empirical Analysis of Predictive Algorithms for Collaborative Filtering, John S. Breese, David Heckerman, Carl Kadie, 1998 Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI'98) (Morgan Kaufmann) DOI: 10.5555/2070081.2070087 - A foundational paper that empirically compares various predictive algorithms, including those based on similarity metrics, for collaborative filtering.
sklearn.metrics.pairwise.cosine_similarity, scikit-learn developers, 2024 - Official documentation for the cosine_similarity function in the scikit-learn library, essential for practical implementation.