As we've seen, techniques like TF-IDF and N-grams can transform text into numerical feature vectors. However, this often results in extremely high-dimensional data. If our vocabulary contains tens or hundreds of thousands of unique terms, our TF-IDF matrix will have that many columns. Working with such high-dimensional, sparse matrices (matrices filled mostly with zeros) presents several challenges:
Dimensionality reduction techniques aim to transform these high-dimensional feature vectors into lower-dimensional representations while preserving as much meaningful information as possible. This can lead to faster training times, reduced memory usage, and sometimes even better model performance by filtering out noise. Let's look at two common linear algebra techniques used for this purpose: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).
PCA is a widely used technique for dimensionality reduction. Its core idea is to identify the directions (principal components) in the data where the variance is highest. It then projects the original data onto a new, lower-dimensional subspace defined by a selected number of these principal components.
Conceptually, PCA performs the following steps:
While PCA is powerful, applying it directly to sparse TF-IDF matrices can be problematic. The initial standardization step (centering) requires subtracting the mean of each feature (column) from every data point. Since TF-IDF matrices are typically very sparse (mostly zeros), subtracting a non-zero mean makes the matrix dense, potentially leading to memory issues that negate the benefits of starting with a sparse representation.
For this reason, while PCA is a fundamental dimensionality reduction technique, variations of SVD are often preferred when working directly with sparse text features like TF-IDF.
Singular Value Decomposition is another powerful matrix factorization technique with applications in dimensionality reduction. SVD decomposes the original matrix X (e.g., our TF-IDF matrix of size m×n, where m is the number of documents and n is the number of terms) into three separate matrices:
X=UΣVTWhere:
For dimensionality reduction, we use a variation called Truncated SVD. Instead of computing the full decomposition, Truncated SVD calculates only the top k singular values and the corresponding vectors in U and V.
X≈UkΣkVkTHere, Uk is m×k, Σk is k×k (containing the top k singular values), and VkT is k×n. The transformed, lower-dimensional representation of the original data is often taken as Xk=UkΣk (or sometimes just Uk), which results in an m×k matrix.
Conceptual overview of Truncated SVD for dimensionality reduction. The original matrix is decomposed, and then only the top k components corresponding to the largest singular values are retained to form a lower-dimensional representation.
A significant advantage of Truncated SVD, especially implementations found in libraries like scikit-learn (TruncatedSVD
), is that it can work directly with sparse matrices without requiring the problematic centering step needed for PCA. This makes it well-suited for large TF-IDF or count matrices commonly encountered in NLP.
Applying SVD to term-document matrices is also known as Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI). LSA aims to uncover underlying semantic structures ("topics" or "concepts") in the text data. The resulting dimensions (k) represent these latent concepts, and the values indicate how strongly each document relates to each concept.
A practical question is how to choose the number of dimensions, k, to keep. There's often a trade-off:
Common approaches include:
While dimensionality reduction is beneficial, keep these points in mind:
In summary, dimensionality reduction techniques like PCA and, more commonly for text, Truncated SVD, are valuable tools for managing the high dimensionality inherent in text feature representations like TF-IDF and N-grams. By projecting data onto a lower-dimensional space, they can make subsequent model training more efficient and potentially improve performance by focusing on the most salient information.
© 2025 ApX Machine Learning