Computing Similarity with Cosine Distance

When item descriptions are represented as numerical vectors through methods like TF-IDF, quantifying their similarity is essential. Within a vector space, similarity has a direct geometric interpretation: vectors that point in a similar direction represent items with similar content. Cosine similarity is a widely used and effective metric for measuring this relationship.

Geometric Intuition

Imagine each item's TF-IDF vector as an arrow starting from the origin (0,0) in a high-dimensional space. The similarity between two items can be determined by the angle between their respective arrows. If two vectors point in nearly the same direction, the angle between them is small, indicating high similarity. If they point in very different directions, the angle is large, indicating low similarity.

Cosine similarity captures this relationship by calculating the cosine of the angle between two vectors. It is not concerned with the magnitude (or length) of the vectors, only their orientation. This is an important property when working with text data. For example, a lengthy movie synopsis and a shorter one might both be about the same topic. Their TF-IDF vectors would have different magnitudes, but they would point in a similar direction in the vector space. By ignoring magnitude, cosine similarity correctly identifies them as similar.

This diagram illustrates three item vectors. Vectors for Item A and Item B are close, resulting in a small angle and high cosine similarity. The vector for Item C points in a different direction, indicating low similarity to both A and B.

The Formula

The formula for cosine similarity between two non-zero vectors, $A$ and $B$ , is defined as the dot product of the vectors divided by the product of their magnitudes.

$\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}$

Let's break down its components:

$A \cdot B$ : The dot product. It measures how much one vector goes in the direction of another. For TF-IDF vectors, where all values are non-negative, a larger dot product suggests greater similarity.
$\|A\|$ and $\|B\|$ : The L2 norm (or Euclidean length) of each vector. This is the magnitude we want to disregard.
Division: By dividing the dot product by the product of the magnitudes, we normalize the result, effectively canceling out the influence of vector length and isolating the orientation.

The resulting score ranges from -1 to 1. However, since TF-IDF vectors contain only non-negative values, the cosine similarity score will range from 0 to 1:

1: The vectors point in the exact same direction. The items are identical in terms of their content.
0: The vectors are orthogonal (at a 90-degree angle). They share no common terms and are completely dissimilar.

Implementation with scikit-learn

While you can implement the formula from scratch using libraries like NumPy, scikit-learn provides an efficient and optimized function, cosine_similarity, which is the standard tool for this task. It takes a matrix of vectors as input and computes the similarity between all pairs of vectors.

Let's assume you have a TF-IDF matrix tfidf_matrix where each row represents a movie and each column represents a term from your vocabulary.

from sklearn.metrics.pairwise import cosine_similarity

# tfidf_matrix is the output from TfidfVectorizer
# Let's assume it has 3 movies (rows) and a vocabulary of 5 words (columns)
# tfidf_matrix.shape -> (3, 5)

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(cosine_sim)

The output, cosine_sim, will be a square matrix where cosine_sim[i][j] is the cosine similarity score between item i and item j. The dimensions of this matrix will be (number_of_items, number_of_items).

Here is what a sample output might look like for three movies:

[[1.         0.75368321 0.10540926]
 [0.75368321 1.         0.15811388]
 [0.10540926 0.15811388 1.        ]]

Notice a few properties of this matrix:

It is symmetric: The similarity between movie A and movie B is the same as between movie B and movie A (cosine_sim[0][1] is equal to cosine_sim[1][0]).
The diagonal is all 1s: The similarity of any movie with itself is always a perfect 1.

This similarity matrix is the engine of our content-based recommender. With it, we can take any given movie and instantly look up a ranked list of the most similar movies to recommend to a user. In the sections that follow, we will use this matrix to build user profiles and generate personalized recommendations.

Was this section helpful?

References

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) DOI: 10.1017/CBO9780511816000 - A standard textbook on information retrieval, covering TF-IDF, vector space models, and cosine similarity for text analysis.
Recommender Systems: An Introduction, Francesco Ricci, Lior Rokach, Bracha Shapira, 2015 (Springer) DOI: 10.1007/978-1-4939-2713-5 - Provides a comprehensive overview of recommender systems, including content-based filtering and the application of similarity metrics.
sklearn.metrics.pairwise.cosine_similarity, scikit-learn developers, 2024 - Official documentation for the cosine_similarity function in scikit-learn, detailing its usage and parameters.