Producing Content-Based Recommendations

Generating personalized recommendations utilizes vectorized item profiles and aggregated user profiles. The goal is to match a user's inferred preferences, encapsulated in their profile vector, against the entire catalog of item vectors. This process effectively identifies and ranks items most aligned with the user's historical tastes.

The Recommendation Generation Workflow

The objective is to calculate a similarity score between a user's profile vector and every item's profile vector. This score quantifies how well each item matches the user's learned preferences. We can then rank items by this score to produce a personalized list.

The procedure can be broken down into four main steps:

Calculate Similarity: Compute the cosine similarity between the target user's profile vector and the vector for every item in the dataset.
Filter Seen Items: Remove any items that the user has already interacted with. Recommending something a user has already seen or rated provides little value.
Rank by Score: Sort the remaining items in descending order based on their similarity scores. The items at the top of this list are the strongest candidates for recommendation.
Select Top-N: Return the top N items from the ranked list as the final recommendation set.

The following diagram illustrates this data flow, from user and item profiles to a final, ranked list of recommendations.

This workflow takes a user's preference vector and the complete item matrix as input, ultimately producing a short, ranked list of new items for that user.

Calculating User-to-Item Similarity

Since both the user profile and item profiles exist in the same vector space, we can reuse cosine similarity to measure their alignment. A high cosine similarity score between a user's profile and an item's profile indicates that the item's features strongly match the features of items the user has previously liked.

Let's assume you have a user's profile vector, user_profile, and the TF-IDF matrix of all items, tfidf_matrix. We can use scikit-learn's cosine_similarity function to perform this calculation efficiently. The function expects two array-like inputs, so we reshape the user profile vector to be a 2D array with a single row.

from sklearn.metrics.pairwise import cosine_similarity

# Assume user_profile is a 1D NumPy array representing the user's taste
# Assume tfidf_matrix is the (items x features) matrix from TF-IDF
# Reshape user_profile to (1, n_features) to make it compatible
user_profile_reshaped = user_profile.reshape(1, -1)

# Compute cosine similarity between the user and all items
user_item_scores = cosine_similarity(user_profile_reshaped, tfidf_matrix)

# The result is a 2D array, so we flatten it to a 1D array of scores
similarity_scores = user_item_scores.flatten()

print(similarity_scores.shape)
# (n_items,)

The resulting similarity_scores is an array where each element at index $i$ corresponds to the similarity score between the user and the item at index $i$ in our original dataset.

From Scores to Recommendations

With the scores calculated, the next step is to transform this raw array into a usable, ranked list. We'll use pandas for this, as it simplifies the process of associating scores with item titles, filtering, and sorting.

First, let's create a DataFrame that contains item titles and their corresponding similarity scores.

import pandas as pd

# Assume 'movies_df' is your original DataFrame with movie titles
recommendation_df = pd.DataFrame({
    'title': movies_df['title'],
    'score': similarity_scores
})

Next, we must filter out items the user has already seen. Let's say we have a list of titles called items_seen_by_user. We can use pandas' isin method to exclude them.

# A list of movie titles the user has already rated positively
items_seen_by_user = ["The Dark Knight", "Inception", "The Prestige"]

# Filter out the seen items
unseen_recommendations = recommendation_df[
    ~recommendation_df['title'].isin(items_seen_by_user)
]

The ~ operator inverts the boolean mask, effectively selecting all rows where the title is not in the items_seen_by_user list.

Finally, we sort this filtered DataFrame by the score column in descending order and select the top N items.

# Sort by score and get the top 10 recommendations
top_10_recommendations = unseen_recommendations.sort_values(
    by='score', ascending=False
).head(10)

print(top_10_recommendations)

The output would look something like this:

	title	score
50	The Dark Knight Rises	0.954
27	Batman Begins	0.921
119	Memento	0.887
95	Interstellar	0.852
...	...	...

This final, ranked list is the output of our content-based recommender. It presents the most relevant items to the user, based on a quantitative measure of similarity to their past preferences. This entire process, from feature extraction to generating a ranked list, forms the backbone of a complete content-based filtering system. In the upcoming hands-on practical, you will implement this workflow from start to finish.

Was this section helpful?

References

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman, 2020 (Cambridge University Press) - Covers the theoretical underpinnings of large-scale data analysis, including concepts such as vector space models, nearest-neighbor search, and dimensionality reduction, which are fundamental to content-based recommendation systems.
sklearn.metrics.pairwise.cosine_similarity, scikit-learn developers, 2024 - Official documentation for the scikit-learn function used to compute cosine similarity, providing usage examples and technical details.