Dimensionality Reduction with PCA

Imagine your dataset has hundreds of features. For example, a real estate dataset might include square footage, number of rooms, age of the building, distance to the nearest school, local crime rate, and dozens of other variables. While more data can be useful, having too many features, or dimensions, can make it difficult for machine learning models to learn effectively. It increases computation time and can lead to a problem known as the "curse of dimensionality," where the model learns from noise instead of the underlying signal in the data.

This is where dimensionality reduction comes in. The goal is to reduce the number of features while preserving as much of the important information in the dataset as possible. One of the most common and effective techniques for this is Principal Component Analysis (PCA).

What PCA Does

PCA is a technique that transforms your data into a new set of features, called principal components. These new components are ordered by how much of the original data's variance they capture. The first principal component (PC1) is engineered to capture the largest possible variance. The second principal component (PC2) captures the next largest variance, with the condition that it must be orthogonal (perpendicular) to the first. This continues for all components.

This process gives you a ranked list of components. To reduce dimensionality, you simply keep the first few components that capture the majority of the information and discard the rest.

The Role of Eigenvectors and Eigenvalues

This is where the eigenvectors and eigenvalues we learned about in Chapter 5 become incredibly useful. PCA works by analyzing the relationships between the features in your dataset. This relationship is captured in a matrix called the covariance matrix.

The eigenvectors of this covariance matrix point in the directions of the highest variance in your data. In fact, these eigenvectors are the principal components. The eigenvector with the largest corresponding eigenvalue is the first principal component, as it points in the direction of the greatest "spread" in the data. The eigenvector with the second-largest eigenvalue is the second principal component, and so on.

The eigenvalues themselves tell you the amount of variance captured by each principal component. A large eigenvalue means its corresponding eigenvector (and principal component) is very significant.

A Visual Example

Let's look at a simple 2D dataset. Imagine plotting two features against each other, and the data points form an elongated cloud.

The principal components (red and orange arrows) identify the axes of greatest variance in the data. PC1 captures the most spread, while PC2 captures the next most.

In this plot, you can see that the data varies most along the direction of the red arrow (PC1). There is much less variation along the orange arrow (PC2). If we wanted to reduce this dataset from two dimensions to one, we could project all the data points onto the line defined by PC1. We would lose the information related to PC2, but since PC1 captures most of the variance, we would retain the most important structure of our data.

The Steps of PCA

Here is a high-level summary of how PCA is performed:

Standardize the Data: Adjust the data so that each feature has a mean of 0 and a standard deviation of 1. This ensures that features with larger scales do not dominate the analysis.
Calculate the Covariance Matrix: Compute the covariance matrix to understand how the different features in the dataset vary with one another.
Find Eigenvalues and Eigenvectors: Calculate the eigenvalues and eigenvectors of the covariance matrix. As we discussed, this is the core of the process.
Sort and Select Components: Sort the eigenvectors in descending order based on their corresponding eigenvalues. The eigenvector with the highest eigenvalue is the first principal component. You then decide how many components to keep. A common approach is to keep enough components to explain a certain percentage of the total variance, for example, 95%.
Transform the Data: Use the selected eigenvectors to transform the original data into a new, lower-dimensional space. The result is a new dataset with fewer features but which retains most of the original information.

By applying PCA, we use the foundations of linear algebra, specifically eigenvalues and eigenvectors, to simplify complex datasets. This makes them easier to visualize, faster to process, and can often lead to better performance in machine learning models by filtering out noise.

Was this section helpful?

References

Principal Component Analysis, I. T. Jolliffe, 2002 (Springer New York) DOI: 10.1007/b98835 - A definitive and comprehensive book on Principal Component Analysis, covering its theory, methods, and applications in detail.
Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006 (Springer) DOI: 10.1007/978-0-387-45528-0 - A standard textbook that provides a thorough statistical and probabilistic treatment of PCA within the context of machine learning.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - A classic reference in statistical learning, with a dedicated chapter explaining PCA as a dimension reduction technique. 2nd edition, available as free PDF.
Linear Algebra and Learning from Data, Gilbert Strang, 2019 (Wellesley-Cambridge Press) - This book connects core linear algebra concepts, including eigenvectors and eigenvalues, directly to their applications in machine learning, such as PCA. Author's official page.