Using Libraries for Matrix Factorization

Implementing matrix factorization algorithms from first principles provides a strong understanding of their internal mechanisms. For practical applications, however, using specialized libraries often brings significant advantages. These libraries offer optimized, thoroughly tested implementations of common algorithms, allowing you to focus on model design and evaluation rather than low-level implementation details.

For building recommendation systems in Python, surprise is a widely used and effective library. Its name is an acronym for Simple Python RecommendatIon System Engine. It offers a collection of ready-to-use prediction algorithms and evaluation tools, all accessible through an interface that will feel familiar to anyone who has worked with scikit-learn.

Getting Started with Surprise

The surprise library streamlines the process of training and testing recommender models. It has its own data structures, but it integrates smoothly with the pandas library you're already familiar with. The typical workflow involves three primary components: the Reader, the Dataset, and the algorithm itself.

Reader: This object is used to parse a file or a DataFrame. Its main purpose is to define the rating scale of your dataset. For example, if user ratings are on a scale of 1 to 5 stars, you would specify that here.
Dataset: This is the central data structure in surprise. It takes your raw data and, with the help of a Reader, converts it into a format that the library's algorithms can work with.
Algorithm: This is the model you want to train, such as SVD, KNNBasic, or NMF.

Let's see how to load data from a pandas DataFrame into a surprise dataset. Assume you have a DataFrame named ratings_df with the columns userID, itemID, and rating.

import pandas as pd
from surprise import Dataset, Reader

# Sample ratings DataFrame
data = {'userID': [1, 1, 2, 2, 3, 3],
        'itemID': [101, 102, 101, 103, 102, 104],
        'rating': [5, 3, 4, 2, 5, 4]}
ratings_df = pd.DataFrame(data)

# 1. Initialize a Reader with the rating scale
# Our ratings are from 1 to 5
reader = Reader(rating_scale=(1, 5))

# 2. Load the data from the DataFrame into a Dataset object
# The columns must be in the order: user, item, rating
data = Dataset.load_from_df(ratings_df[['userID', 'itemID', 'rating']], reader)

With just a few lines of code, your data is now ready for training.

Training an SVD Model

In surprise, the SVD class implements the matrix factorization technique we discussed earlier. It is not the "pure" Singular Value Decomposition from linear algebra but a model-based algorithm optimized using Stochastic Gradient Descent (SGD) to find the latent factor matrices $P$ and $Q$ .

When you instantiate the SVD model, you can configure its hyperparameters, which directly correspond to the concepts from our previous sections.

n_factors: The number of latent factors, $K$ .
n_epochs: The number of times the SGD optimizer will iterate over the entire training dataset.
lr_all: The learning rate, $\gamma$ , for all parameters.
reg_all: The regularization term, $\lambda$ , applied to all parameters.

Here is a diagram showing the typical data flow within the surprise library for training a model.

The process begins with raw data in a DataFrame, which is converted into a surprise Dataset. This dataset is then used to build a trainset, on which an algorithm like SVD is fitted. Once trained, the model can make predictions.

Once your data is loaded, you can build a "trainset" from it. A trainset is the data structure that surprise algorithms are trained on. Then, you instantiate your SVD model and call its fit() method.

from surprise import SVD

# Build a trainset from the entire dataset
trainset = data.build_full_trainset()

# Instantiate the SVD algorithm
# We will use 50 factors, 20 epochs, and default learning/regularization
algo = SVD(n_factors=50, n_epochs=20, random_state=42)

# Train the model on the trainset
algo.fit(trainset)

The model is now trained. The algo object contains the learned latent factors for all users and items.

Making Predictions

After fitting the model, you can predict the rating for any user-item pair using the predict() method. This method calculates the estimated rating $\hat{r}_{ui}$ using the dot product of the user's latent factor vector, $p_u$ , and the item's latent factor vector, $q_i$ .

\hat{r}_{ui} = p_u \cdot q_i

Let's predict the rating for user 3 and item 101.

# Predict a rating for a user and item
prediction = algo.predict(uid=3, iid=101)

# Print the prediction details
print(f"User ID:      {prediction.uid}")
print(f"Item ID:      {prediction.iid}")
print(f"Estimated Rating: {prediction.est:.4f}")

The output would look something like this:

User ID:      3
Item ID:      101
Estimated Rating: 4.2315

The predict() method returns a Prediction object, which contains several useful pieces of information, including the user ID, item ID, the original rating (if available in the dataset), and the estimated rating est. This estimated rating is the model's best guess for how the user would rate the item.

By using a library like surprise, we've managed to load data, train a sophisticated matrix factorization model, and generate a prediction in just a handful of lines of code. This abstracts away the complexities of implementing the SGD optimization loop and regularization, allowing us to move quickly to building and testing a complete recommender.

In the next section, we will expand on this foundation to generate a ranked list of top-N recommendations for a user, which is the ultimate goal of most recommendation systems.

Was this section helpful?

References

surprise library documentation, Nicolas Hug and surprise contributors, 2023 - Official documentation for the surprise library, covering API details, installation, and usage examples for building recommender systems with matrix factorization.
Recommender Systems: The Textbook, Charu C. Aggarwal, 2016 (Springer) DOI: 10.1007/978-3-319-29659-3 - A comprehensive textbook covering fundamental concepts and advanced topics in recommender systems, with detailed sections on collaborative filtering and various matrix factorization models.