While Singular Value Decomposition (SVD) provides a powerful way to factorize a matrix, applying its classical form directly to the sparse user-item interaction matrix found in recommendation systems is often impractical. The high number of missing values makes direct computation difficult and inefficient. Instead of finding an exact decomposition, we can learn the user and item factor matrices, and , through an iterative optimization process. Stochastic Gradient Descent (SGD) is a highly effective and scalable algorithm for this task.
Our goal is to find factor matrices and that produce the best possible predictions. We define "best" as minimizing the difference between our predicted ratings () and the actual known ratings (). A common way to measure this difference is the squared error.
To prevent the model from simply memorizing the training data (overfitting), we add a regularization term. This term penalizes large values in our factor vectors, encouraging the model to find simpler, more generalizable patterns. Combining these gives us our objective function, which we aim to minimize:
Here:
This equation represents the total regularized squared error. Our task is to find the values for all and that make as small as possible.
Gradient descent is an iterative optimization algorithm used to find the minimum of a function. The main idea is to start with an initial guess for our parameters (the elements of and ) and repeatedly adjust them in the direction that most steeply decreases the error. We determine this direction by calculating the gradient, or the partial derivative, of the error function with respect to each parameter.
However, calculating the gradient using all known ratings in every step (batch gradient descent) would be extremely slow for large datasets. This is where the "stochastic" part of SGD becomes useful. Instead of using the entire dataset, SGD updates the parameters by considering just one training example at a time. For our recommendation model, a single training example is a single known rating, .
For each rating, we perform the following steps:
The update rules, derived from the partial derivatives of our objective function, are straightforward. First, we calculate the prediction error:
Then, we use this error to adjust the factor vectors for the user and item :
The parameter (alpha) is the learning rate, which controls the size of the step we take. A small learning rate leads to slow but stable convergence, while a large one can speed up learning but risks overshooting the minimum.
The complete algorithm involves initializing the factor matrices and then iterating through the dataset multiple times, updating the factors for each known rating.
By repeating this process, the vectors in and gradually shift from their random initial values to values that encode meaningful latent features, minimizing the overall prediction error on the training data.
The iterative update process in SGD for a single user-item rating. The error between the predicted and actual rating is used to adjust both the user's and the item's latent factor vectors.
The performance of an SGD-trained model depends heavily on its hyperparameters. The two most important ones in our model are:
Finding the right combination of these hyperparameters usually requires experimentation and techniques like grid search or random search, which we will examine when we evaluate our models. By mastering SGD, you can train matrix factorization models that are both scalable and highly effective at revealing the hidden preferences of users.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with