When item descriptions are represented as numerical vectors through methods like TF-IDF, quantifying their similarity is essential. Within a vector space, similarity has a direct geometric interpretation: vectors that point in a similar direction represent items with similar content. Cosine similarity is a widely used and effective metric for measuring this relationship.
Imagine each item's TF-IDF vector as an arrow starting from the origin (0,0) in a high-dimensional space. The similarity between two items can be determined by the angle between their respective arrows. If two vectors point in nearly the same direction, the angle between them is small, indicating high similarity. If they point in very different directions, the angle is large, indicating low similarity.
Cosine similarity captures this relationship by calculating the cosine of the angle between two vectors. It is not concerned with the magnitude (or length) of the vectors, only their orientation. This is an important property when working with text data. For example, a lengthy movie synopsis and a shorter one might both be about the same topic. Their TF-IDF vectors would have different magnitudes, but they would point in a similar direction in the vector space. By ignoring magnitude, cosine similarity correctly identifies them as similar.
This diagram illustrates three item vectors. Vectors for Item A and Item B are close, resulting in a small angle and high cosine similarity. The vector for Item C points in a different direction, indicating low similarity to both A and B.
The formula for cosine similarity between two non-zero vectors, and , is defined as the dot product of the vectors divided by the product of their magnitudes.
Let's break down its components:
The resulting score ranges from -1 to 1. However, since TF-IDF vectors contain only non-negative values, the cosine similarity score will range from 0 to 1:
While you can implement the formula from scratch using libraries like NumPy, scikit-learn provides an efficient and optimized function, cosine_similarity, which is the standard tool for this task. It takes a matrix of vectors as input and computes the similarity between all pairs of vectors.
Let's assume you have a TF-IDF matrix tfidf_matrix where each row represents a movie and each column represents a term from your vocabulary.
from sklearn.metrics.pairwise import cosine_similarity
# tfidf_matrix is the output from TfidfVectorizer
# Let's assume it has 3 movies (rows) and a vocabulary of 5 words (columns)
# tfidf_matrix.shape -> (3, 5)
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
The output, cosine_sim, will be a square matrix where cosine_sim[i][j] is the cosine similarity score between item i and item j. The dimensions of this matrix will be (number_of_items, number_of_items).
Here is what a sample output might look like for three movies:
[[1. 0.75368321 0.10540926]
[0.75368321 1. 0.15811388]
[0.10540926 0.15811388 1. ]]
Notice a few properties of this matrix:
cosine_sim[0][1] is equal to cosine_sim[1][0]).This similarity matrix is the engine of our content-based recommender. With it, we can take any given movie and instantly look up a ranked list of the most similar movies to recommend to a user. In the sections that follow, we will use this matrix to build user profiles and generate personalized recommendations.
Was this section helpful?
cosine_similarity function in scikit-learn, detailing its usage and parameters.© 2026 ApX Machine LearningEngineered with