Before you can calculate any performance metrics, you must first prepare your data. Typical supervised machine learning often uses a function like scikit-learn's train_test_split to randomly shuffle and divide datasets. For recommendation systems, however, this random approach is not just suboptimal; it can produce highly misleading results.
Recommendation is often a time-sensitive task. We want to predict what a user will like in the future based on their activity in the past. A random split violates this fundamental temporal order.
Imagine a user rates three movies over a week: Movie A on Monday, Movie B on Wednesday, and Movie C on Friday. A random split might place the ratings for Movie A and Movie C in the training set and the rating for Movie B in the test set. The model would then be trained on data from both before and after the event it is trying to predict. This "data leakage" from the future gives the model an unfair advantage that does not exist in a live environment, leading to inflated performance metrics and a false sense of confidence.
The most reliable way to create training and test sets for offline evaluation is to mimic the flow of time. The process, often called a temporal or time-based split, ensures you always train on past events to predict future ones.
The standard procedure is to perform a user-wise split: for each user, their most recent interactions are held out for the test set, while their older interactions are used for training. This "leave-one-out" approach, where the very last interaction is used for testing, is a common and effective method.
For each user, interactions are sorted by time. The most recent interaction is placed in the test set (red), while all prior interactions form the training set (blue).
Let's see how to implement this using pandas. We assume your data is in a DataFrame with user_id, item_id, and a timestamp column that indicates when the interaction occurred.
import pandas as pd
# Assume 'ratings_df' is a DataFrame with user_id, item_id, rating, and timestamp
# Example DataFrame:
data = {'user_id': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4],
'item_id': [101, 102, 103, 101, 104, 102, 103, 105, 106, 107],
'rating': [5, 4, 3, 4, 5, 2, 5, 4, 3, 5],
'timestamp': [978300760, 978300761, 978300762, 978300763, 978300764,
978300765, 978300766, 978300767, 978300768, 978300769]}
ratings_df = pd.DataFrame(data)
# Ensure the DataFrame is sorted correctly
ratings_df = ratings_df.sort_values(by=['user_id', 'timestamp'])
# Create empty lists to hold the data
train_data = []
test_data = []
# Group by user and split the data
for user_id, group in ratings_df.groupby('user_id'):
# For a user to be in the test set, they must have more than one rating
if len(group) > 1:
# The last interaction goes into the test set
test_data.append(group.iloc[-1:])
# All other interactions go into the training set
train_data.append(group.iloc[:-1])
# Concatenate the lists of DataFrames back into single DataFrames
train_df = pd.concat(train_data)
test_df = pd.concat(test_data)
print("Training Set Size:", len(train_df))
print("Test Set Size:", len(test_df))
In this code, we first sort all interactions by user and then by time. The groupby('user_id') operation allows us to process the timeline of each user independently. For each user with more than one interaction, we use iloc[-1:] to select their last interaction for the test set and iloc[:-1] to select all preceding interactions for the training set.
This temporal splitting approach is solid, but there are a few details to manage.
Users with a Single Interaction: The code above correctly handles users who have only interacted with one item. Since len(group) would not be greater than 1, their single interaction is not added to either set in this implementation. This is often the desired behavior, as you cannot train and then test on a user with only one data point. They exist in the historical data but are excluded from the evaluation set.
Ensuring Test Set Validity: Your goal is to measure how well the model recommends items to existing users. This means that every user and every item in your test_df should ideally also be present in your train_df. If a user's only interaction is their latest one (which is impossible by our logic) or if an item appears for the first time in the test set, your model may not be able to make a prediction. This is the cold-start problem manifesting in your evaluation. For a standard evaluation of algorithms like collaborative filtering, you might add a step to filter the test set to ensure its users and items have been seen during training.
With your data correctly partitioned into a training set that respects the past and a test set that represents the future, you are now ready to train your models and use the metrics we will cover next to see how well they perform.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with