Singular Value Decomposition (SVD) for Recommendations

Singular Value Decomposition (SVD) is a foundational linear algebra technique used for matrix factorization. It provides a principled way to break down any matrix into a product of three other matrices, revealing its underlying structure and facilitating the discovery of latent factors.

The Mathematics of SVD

Formally, SVD states that any rectangular matrix $R$ of size $m \times n$ (representing $m$ users and $n$ items) can be decomposed into three matrices:

R = U \Sigma V^T

Let's examine each component:

$U$ : An $m \times m$ orthogonal matrix. Its columns are the left-singular vectors. In our context, you can think of these vectors as representing users in a latent factor space.
$\Sigma$ (Sigma): An $m \times n$ rectangular diagonal matrix. Its diagonal entries, called singular values, represent the "strength" or importance of each latent factor. These values are always non-negative and are conventionally sorted in descending order.
$V^T$ (V-transpose): An $n \times n$ orthogonal matrix. Its rows (which are the columns of $V$ ) are the right-singular vectors. These can be interpreted as representing items in the same latent factor space.

The decomposition of the user-item matrix $R$ into three distinct matrices: $U$ , $\Sigma$ , and $V^T$ .

Adapting SVD for Sparse Data

There is a significant challenge when applying this "pure" form of SVD directly to recommendation problems: it requires a complete matrix with no missing values. Our user-item interaction matrix, $R$ , is almost always sparse, meaning most entries are unknown. If we were to fill the missing entries with zeros, the algorithm would interpret those as actual ratings of 0, which would heavily skew the resulting factors.

Because of this, we don't use the classical SVD algorithm. Instead, we use algorithms that are inspired by SVD. These methods aim to find factor matrices that approximate the original matrix only for the ratings we know about. The goal is no longer to perfectly reconstruct $R$ , but to find the latent factor matrices $P$ (for users) and $Q$ (for items) that best model the observed user-item interactions.

This modified approach is often still referred to as SVD in the recommender systems literature, though it's technically an approximation. The objective is to find $P$ and $Q$ such that their product, $P \cdot Q^T$ , is a good approximation of $R$ .

Dimensionality Reduction with Truncated SVD

The real power of SVD in recommendations comes from using it for dimensionality reduction. The singular values in the $\Sigma$ matrix are sorted by importance. The first singular value corresponds to the most significant pattern in the data, the second to the next most significant, and so on. Many of the later singular values are often small and can be treated as noise.

We can exploit this by keeping only the top $k$ latent factors, where $k$ is a number much smaller than the original number of users or items. This process is known as Truncated SVD. We reduce the dimensionality of our matrices:

$U$ becomes $U_k$ (an $m \times k$ matrix).
$\Sigma$ becomes $\Sigma_k$ (a $k \times k$ matrix).
$V^T$ becomes $V_k^T$ (a $k \times n$ matrix).

Our new approximation of the ratings matrix, $\hat{R}$ , is:

\hat{R} = U_k \Sigma_k V_k^T

This has two major benefits:

Generalization: By focusing on the $k$ most important patterns, the model captures the underlying preference structure rather than memorizing the noisy details of the training data. This helps it make better predictions for unseen items.
Efficiency: The resulting factor matrices are much smaller, making storage and computation significantly more efficient.

The approximation of $R$ using truncated matrices, where $k$ is the number of latent factors. This reduces the dimensionality and captures the most significant patterns.

Making Predictions from Latent Factors

Once our model has learned the user-factor matrix $P$ (analogous to $U_k \sqrt{\Sigma_k}$ ) and the item-factor matrix $Q$ (analogous to $(\sqrt{\Sigma_k} V_k^T)^T$ ), making a prediction is straightforward.

Each user $u$ is represented by a vector $p_u$ of length $k$ , and each item $i$ is represented by a vector $q_i$ of length $k$ . The predicted rating $\hat{r}_{ui}$ is simply the dot product of these two vectors:

\hat{r}_{ui} = p_u \cdot q_i = \sum_{j=1}^{k} p_{uj}q_{ij}

This dot product measures the alignment between a user's preferences and an item's characteristics in the learned latent space. A user vector with high values for factors that also have high values in an item's vector will result in a high predicted rating.

The main task, then, is to find the values for the matrices $P$ and $Q$ . Since we cannot use classical SVD on our sparse matrix, we must turn to other methods. We reframe the problem as an optimization task: find the latent factors that minimize the prediction error on the ratings we already know. The next section will cover how we can achieve this using an iterative optimization algorithm called Stochastic Gradient Descent.

Was this section helpful?

References

Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Yehuda Koren, Robert Bell, and Chris Volinsky, 2008 Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM) DOI: 10.1145/1401890.1401944 - Describes matrix factorization techniques, including SVD-like methods, for collaborative filtering and addressing sparse data, as developed for the Netflix Prize.
Recommender Systems Handbook, Francesco Ricci, Lior Rokach, and Bracha Shapira, 2022 (Springer) DOI: 10.1007/978-1-0716-1738-4 - Provides a comprehensive overview of recommender systems, with dedicated chapters on matrix factorization and SVD variants.
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman, 2020 (Cambridge University Press) - Discusses SVD as a method for dimensionality reduction and its application in recommender systems, providing practical examples.