Generating personalized recommendations utilizes vectorized item profiles and aggregated user profiles. The goal is to match a user's inferred preferences, encapsulated in their profile vector, against the entire catalog of item vectors. This process effectively identifies and ranks items most aligned with the user's historical tastes.
The objective is to calculate a similarity score between a user's profile vector and every item's profile vector. This score quantifies how well each item matches the user's learned preferences. We can then rank items by this score to produce a personalized list.
The procedure can be broken down into four main steps:
The following diagram illustrates this data flow, from user and item profiles to a final, ranked list of recommendations.
This workflow takes a user's preference vector and the complete item matrix as input, ultimately producing a short, ranked list of new items for that user.
Since both the user profile and item profiles exist in the same vector space, we can reuse cosine similarity to measure their alignment. A high cosine similarity score between a user's profile and an item's profile indicates that the item's features strongly match the features of items the user has previously liked.
Let's assume you have a user's profile vector, user_profile, and the TF-IDF matrix of all items, tfidf_matrix. We can use scikit-learn's cosine_similarity function to perform this calculation efficiently. The function expects two array-like inputs, so we reshape the user profile vector to be a 2D array with a single row.
from sklearn.metrics.pairwise import cosine_similarity
# Assume user_profile is a 1D NumPy array representing the user's taste
# Assume tfidf_matrix is the (items x features) matrix from TF-IDF
# Reshape user_profile to (1, n_features) to make it compatible
user_profile_reshaped = user_profile.reshape(1, -1)
# Compute cosine similarity between the user and all items
user_item_scores = cosine_similarity(user_profile_reshaped, tfidf_matrix)
# The result is a 2D array, so we flatten it to a 1D array of scores
similarity_scores = user_item_scores.flatten()
print(similarity_scores.shape)
# (n_items,)
The resulting similarity_scores is an array where each element at index corresponds to the similarity score between the user and the item at index in our original dataset.
With the scores calculated, the next step is to transform this raw array into a usable, ranked list. We'll use pandas for this, as it simplifies the process of associating scores with item titles, filtering, and sorting.
First, let's create a DataFrame that contains item titles and their corresponding similarity scores.
import pandas as pd
# Assume 'movies_df' is your original DataFrame with movie titles
recommendation_df = pd.DataFrame({
'title': movies_df['title'],
'score': similarity_scores
})
Next, we must filter out items the user has already seen. Let's say we have a list of titles called items_seen_by_user. We can use pandas' isin method to exclude them.
# A list of movie titles the user has already rated positively
items_seen_by_user = ["The Dark Knight", "Inception", "The Prestige"]
# Filter out the seen items
unseen_recommendations = recommendation_df[
~recommendation_df['title'].isin(items_seen_by_user)
]
The ~ operator inverts the boolean mask, effectively selecting all rows where the title is not in the items_seen_by_user list.
Finally, we sort this filtered DataFrame by the score column in descending order and select the top N items.
# Sort by score and get the top 10 recommendations
top_10_recommendations = unseen_recommendations.sort_values(
by='score', ascending=False
).head(10)
print(top_10_recommendations)
The output would look something like this:
| title | score | |
|---|---|---|
| 50 | The Dark Knight Rises | 0.954 |
| 27 | Batman Begins | 0.921 |
| 119 | Memento | 0.887 |
| 95 | Interstellar | 0.852 |
| ... | ... | ... |
This final, ranked list is the output of our content-based recommender. It presents the most relevant items to the user, based on a quantitative measure of similarity to their past preferences. This entire process, from feature extraction to generating a ranked list, forms the backbone of a complete content-based filtering system. In the upcoming hands-on practical, you will implement this workflow from start to finish.
Was this section helpful?
sklearn.metrics.pairwise.cosine_similarity, scikit-learn developers, 2024 - Official documentation for the scikit-learn function used to compute cosine similarity, providing usage examples and technical details.© 2026 ApX Machine LearningEngineered with