All Courses

SVD for Dimensionality Reduction

Singular Value Decomposition provides a way to break down any matrix $A$ into the product $A = U\Sigma V^T$ . This decomposition isn't just mathematically elegant. It turns out to be extremely useful for simplifying data, particularly for dimensionality reduction.

The core idea relies on the diagonal matrix $\Sigma$ . The diagonal entries of $\Sigma$ , denoted $\sigma_1, \sigma_2, \dots, \sigma_r$ (where $r$ is the rank of $A$ ), are the singular values of $A$ . By convention, these are sorted in descending order: $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0$ . These singular values measure the importance or "energy" captured along the directions defined by the corresponding columns of $U$ (left singular vectors) and $V$ (right singular vectors).

Dimensionality reduction using SVD works by keeping only the most significant parts of this decomposition. Specifically, we select the first $k$ singular values (where $k < r$ ) and their associated singular vectors. We discard the remaining $r-k$ singular values and vectors, which correspond to the directions where the data varies the least.

Mathematically, we form an approximation of $A$ by truncating the matrices $U$ , $\Sigma$ , and $V$ :

Keep the first $k$ columns of $U$ , forming $U_k$ (an $m \times k$ matrix).
Keep the top-left $k \times k$ block of $\Sigma$ , forming $\Sigma_k$ (a diagonal matrix with $\sigma_1, \dots, \sigma_k$ on the diagonal).
Keep the first $k$ columns of $V$ (which correspond to the first $k$ rows of $V^T$ ), forming $V_k$ . Then take its transpose $V_k^T$ (a $k \times n$ matrix).

The lower-rank approximation of $A$ is then given by:

$A_k = U_k \Sigma_k V_k^T$

This matrix $A_k$ has rank $k$ and is the best rank- $k$ approximation of the original matrix $A$ in the sense that it minimizes the Frobenius norm of the difference $\|A - A_k\|_F$ . This is a fundamental property often referred to as the Eckart-Young theorem, although understanding the outcome is more important than the name: SVD gives you the optimal way to compress your matrix $A$ into a rank- $k$ representation.

How SVD Reduces Dimensions

Think of the original $m \times n$ matrix $A$ as representing $m$ data points, each with $n$ features (or vice versa). SVD finds new, orthogonal bases for the row space (via $V$ ) and column space (via $U$ ) of $A$ . The transformation $A$ essentially maps vectors from the row space basis to the column space basis, scaling them by the singular values.

Large Singular Values ( $\sigma_i$ ): Indicate dimensions (principal directions) where the data exhibits significant variation. These directions capture the primary structure of the data.
Small Singular Values ( $\sigma_i$ ): Indicate dimensions where the data has little variation. These directions might represent noise or less important details.

By setting the smaller singular values to zero (effectively what truncating to $A_k$ does), we project the original data onto a lower-dimensional subspace spanned by the principal directions associated with the largest singular values. We retain the most significant variations while discarding the less informative ones.

Choosing the Number of Dimensions $k$

The choice of $k$ , the number of dimensions to retain, is application-dependent. Common strategies include:

Fixed Number: Choosing a predetermined number of dimensions (e.g., reducing to $k=2$ or $k=3$ for visualization).
Variance Explained: Calculating the proportion of variance captured by the top $k$ singular values. The total variance (sum of squares) is proportional to the sum of the squares of all singular values, $\sum_{i=1}^r \sigma_i^2$ . We choose $k$ such that the retained variance reaches a desired threshold (e.g., 90%, 95%, 99%): $\frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^r \sigma_i^2} \ge \text{threshold}$ Plotting the singular values (or the cumulative sum of their squares) helps visualize where the "elbow" occurs, indicating diminishing returns for adding more dimensions.

A typical plot showing sorted singular values decreasing rapidly, while the cumulative variance explained quickly approaches 100%. This helps in selecting a suitable value for $k$ .

Connection to Principal Component Analysis (PCA)

Dimensionality reduction using SVD is closely related to Principal Component Analysis (PCA). In fact, SVD provides a numerically stable way to compute the principal components of a dataset. If the data matrix $A$ has its columns centered (mean subtracted), the right singular vectors (columns of $V$ ) correspond to the principal directions, and the squared singular values ( $\sigma_i^2$ ) are proportional to the variance captured by each principal component. Applying the truncated SVD $A_k = U_k \Sigma_k V_k^T$ effectively projects the data onto the first $k$ principal components.

Benefits and Costs

Using SVD for dimensionality reduction offers several advantages:

Data Compression: Reduces storage requirements for the data matrix.
Noise Reduction: Smaller singular values often correspond to noise, which gets filtered out.
Faster Computation: Subsequent algorithms run faster on the reduced-dimension data.
Visualization: Allows projecting high-dimensional data onto 2 or 3 dimensions.

However, there are considerations:

Information Loss: Dimensionality reduction is inherently lossy. Important information might be discarded if $k$ is chosen too small.
Interpretability: The new dimensions (combinations of original features) might be harder to interpret directly.
Computational Cost: Calculating the SVD itself can be computationally intensive for extremely large matrices, although efficient algorithms exist in libraries like NumPy (numpy.linalg.svd) and SciPy (scipy.linalg.svd).

Despite the computational cost of the decomposition itself, SVD remains a fundamental tool for reducing the complexity of datasets in many machine learning pipelines, enabling more efficient processing and sometimes leading to better model generalization by filtering out noise.

Was this section helpful?

SVD for Dimensionality Reduction

How SVD Reduces Dimensions

Choosing the Number of Dimensions kkk

Connection to Principal Component Analysis (PCA)

Benefits and Costs

Choosing the Number of Dimensions $k$