To find users with similar tastes or items that are frequently enjoyed together, we must first organize our interaction data into a structure that allows for mathematical comparison. The standard way to do this in collaborative filtering is by constructing a user-item interaction matrix.
This matrix provides a complete picture of all user-item interactions within our dataset. By convention, each row represents a unique user, and each column represents a unique item. The value in the cell at the intersection of a user's row and an item's column, which we can denote as , represents the interaction between user and item .
The value of can be:
null, to indicate that no interaction has occurred.Most importantly, for any given user, they will have only interacted with a very small fraction of the total available items. This means the majority of cells in the matrix will be empty. This property is known as sparsity, and it is a defining characteristic of recommendation datasets.
The diagram shows a sparse matrix where colored cells represent recorded ratings and gray cells with a question mark represent unknown interactions. The goal of collaborative filtering is to predict the values for these unknown cells.
In practice, your data will likely be in a "long" format, with each row representing a single interaction (e.g., user_id, item_id, rating). We can use the pandas library to transform this data into our desired user-item matrix.
Let's assume we have a DataFrame ratings_df with the following structure:
import pandas as pd
import numpy as np
data = {
'user_id': ['User A', 'User A', 'User B', 'User B', 'User C', 'User C'],
'item_id': ['Item 1', 'Item 3', 'Item 2', 'Item 5', 'Item 1', 'Item 4'],
'rating': [5, 2, 3, 4, 4, 1]
}
ratings_df = pd.DataFrame(data)
print(ratings_df)
Output:
user_id item_id rating
0 User A Item 1 5
1 User A Item 3 2
2 User B Item 2 3
3 User B Item 5 4
4 User C Item 1 4
5 User C Item 4 1
We can pivot this DataFrame to create the user-item matrix, where the index is user_id, columns are item_id, and the values are the rating. Missing interactions will automatically be filled with a null value like NaN.
user_item_matrix = ratings_df.pivot_table(
index='user_id',
columns='item_id',
values='rating'
)
print(user_item_matrix)
Output:
item_id Item 1 Item 2 Item 3 Item 4 Item 5
user_id
User A 5.0 NaN 2.0 NaN NaN
User B NaN 3.0 NaN NaN 4.0
User C 4.0 NaN NaN 1.0 NaN
This resulting DataFrame is our user-item interaction matrix. The NaN values explicitly show the sparsity we discussed. Because these matrices can become enormous for real datasets, specialized libraries often use efficient data structures like sparse matrices from SciPy to store them without consuming excessive memory. For our purposes, the pandas DataFrame is perfectly suitable for learning and implementation.
Once we have this matrix, we can view each row as a user vector and each column as an item vector.
User A is [5.0, NaN, 2.0, NaN, NaN].Item 1 is [5.0, NaN, 4.0].This vector representation is powerful because it allows us to use mathematical measures of distance or similarity. For instance, we can calculate the similarity between the vectors of User A and User C to determine if they have similar tastes. This is the core operation we will explore in the upcoming sections on finding "neighbors." By converting abstract user behavior into a concrete matrix, we have established the foundation for our collaborative filtering algorithms.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with