Optimization with Stochastic Gradient Descent (SGD)

While Singular Value Decomposition (SVD) provides a powerful way to factorize a matrix, applying its classical form directly to the sparse user-item interaction matrix found in recommendation systems is often impractical. The high number of missing values makes direct computation difficult and inefficient. Instead of finding an exact decomposition, we can learn the user and item factor matrices, $P$ and $Q$ , through an iterative optimization process. Stochastic Gradient Descent (SGD) is a highly effective and scalable algorithm for this task.

The Objective: Minimizing Prediction Error

Our goal is to find factor matrices $P$ and $Q$ that produce the best possible predictions. We define "best" as minimizing the difference between our predicted ratings ( $\hat{r}_{ui}$ ) and the actual known ratings ( $r_{ui}$ ). A common way to measure this difference is the squared error.

To prevent the model from simply memorizing the training data (overfitting), we add a regularization term. This term penalizes large values in our factor vectors, encouraging the model to find simpler, more generalizable patterns. Combining these gives us our objective function, which we aim to minimize:

E = \sum_{(u,i) \in K} (r_{ui} - p_u \cdot q_i)^2 + \lambda (\sum_{u} ||p_u||^2 + \sum_{i} ||q_i||^2)

Here:

$K$ is the set of all user-item pairs for which we have a rating.
$p_u$ is the latent factor vector for user $u$ .
$q_i$ is the latent factor vector for item $i$ .
$\lambda$ (lambda) is the regularization parameter, which controls the strength of the penalty.

This equation represents the total regularized squared error. Our task is to find the values for all $p_u$ and $q_i$ that make $E$ as small as possible.

Finding the Minimum with Gradient Descent

Gradient descent is an iterative optimization algorithm used to find the minimum of a function. The main idea is to start with an initial guess for our parameters (the elements of $P$ and $Q$ ) and repeatedly adjust them in the direction that most steeply decreases the error. We determine this direction by calculating the gradient, or the partial derivative, of the error function with respect to each parameter.

However, calculating the gradient using all known ratings in every step (batch gradient descent) would be extremely slow for large datasets. This is where the "stochastic" part of SGD becomes useful. Instead of using the entire dataset, SGD updates the parameters using just one training example at a time. For our recommendation model, a single training example is a single known rating, $(u, i, r_{ui})$ .

For each rating, we perform the following steps:

Calculate the prediction error for that specific rating.
Compute the gradient of the error with respect to the user's factor vector ( $p_u$ ) and the item's factor vector ( $q_i$ ).
Update these two vectors by taking a small step in the opposite direction of the gradient.

The update rules, derived from the partial derivatives of our objective function, are straightforward. First, we calculate the prediction error:

e_{ui} = r_{ui} - \hat{r}_{ui} = r_{ui} - p_u \cdot q_i

Then, we use this error to adjust the factor vectors for the user $u$ and item $i$ :

p_u \leftarrow p_u + \alpha (e_{ui} \cdot q_i - \lambda \cdot p_u)

q_i \leftarrow q_i + \alpha (e_{ui} \cdot p_u - \lambda \cdot q_i)

The parameter $\alpha$ (alpha) is the learning rate, which controls the size of the step we take. A small learning rate leads to slow but stable convergence, while a large one can speed up learning but risks overshooting the minimum.

The SGD Algorithm for Matrix Factorization

The complete algorithm involves initializing the factor matrices and then iterating through the dataset multiple times, updating the factors for each known rating.

Initialize: Create the user-factor matrix $P$ and the item-factor matrix $Q$ . Fill them with small random numbers. This random start is important to break symmetry and allow the factors to learn different features.
Iterate: Loop through the training data for a set number of epochs. An epoch is one full pass over the entire training dataset.
Shuffle and Process: In each epoch, it's good practice to shuffle the order of the ratings. Then, loop through each known rating $(u, i, r_{ui})$ .
Update: For each rating, calculate the prediction error $e_{ui}$ and apply the update rules to adjust the vectors $p_u$ and $q_i$ .

By repeating this process, the vectors in $P$ and $Q$ gradually shift from their random initial values to values that encode meaningful latent features, minimizing the overall prediction error on the training data.

The iterative update process in SGD for a single user-item rating. The error between the predicted and actual rating is used to adjust both the user's and the item's latent factor vectors.

Tuning the Hyperparameters

The performance of an SGD-trained model depends heavily on its hyperparameters. The two most important ones in our model are:

Learning Rate ( $\alpha$ ): This parameter determines how large of a step the algorithm takes during each update. If $\alpha$ is too large, the algorithm might become unstable and diverge, with the error increasing instead of decreasing. If it's too small, training will be very slow. A common approach is to start with a value like 0.005 and adjust it based on the model's performance on a validation set.
Regularization Parameter ( $\lambda$ ): This parameter controls the trade-off between fitting the training data well and keeping the factor vectors small to avoid overfitting. A larger $\lambda$ imposes a stronger penalty on large factor values, leading to a simpler model that may generalize better but could underfit the training data. A smaller $\lambda$ allows the model more freedom to fit the training data, but it increases the risk of overfitting. Typical values might range from 0.01 to 0.1.

Finding the right combination of these hyperparameters usually requires experimentation and techniques like grid search or random search, which we will examine when we evaluate our models. By mastering SGD, you can train matrix factorization models that are both scalable and highly effective at revealing the hidden preferences of users.

Was this section helpful?

References

Matrix Factorization Techniques for Recommender Systems, Yehuda Koren, Robert Bell, Chris Volinsky, 2009 Computer, Vol. 42 (IEEE Computer Society) DOI: 10.1109/MC.2009.263 - This paper introduced and popularized matrix factorization with SGD for recommendation systems, especially after its success in the Netflix Prize. It describes the objective function and iterative optimization process.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapter 8 covers optimization algorithms, including SGD, and various regularization methods. This material is fundamental for understanding the underlying mechanics of iterative model training.
Matrix Factorization Methods for Recommender Systems, Yehuda Koren, Robert Bell, 2015 (Springer) DOI: 10.1007/978-1-4939-2708-2_5 - This chapter from the Recommender Systems Handbook (2nd Edition) provides a detailed overview of matrix factorization in recommender systems, explaining various models and optimization strategies, including the use of SGD.
Lecture Notes on Gradient Descent, CS229, Andrew Ng, 2022 - These well-regarded lecture notes offer a clear introduction to gradient descent, providing foundational understanding applicable to SGD optimization in matrix factorization.