When working with datasets, especially those with numerous features, you might encounter challenges often referred to as the "curse of dimensionality." High-dimensional spaces behave differently from our intuitive understanding of 2D or 3D space, and this can make machine learning tasks more difficult. Models can become overly complex, computation times can increase significantly, and patterns can be harder to discern. Principal Component Analysis (PCA) is a widely used technique to address this by reducing the number of features in your dataset while aiming to preserve as much of the original data's variability as possible.
PCA achieves dimensionality reduction by transforming your original features into a new set of features, called principal components. These principal components are linear combinations of the original features and are designed to be uncorrelated with each other. The "principal" in their name signifies their importance: the first principal component captures the largest possible variance in the data, the second principal component captures the next largest variance (subject to being orthogonal, or perpendicular, to the first), and so on. By selecting a subset of these principal components, typically those that explain most of the variance, you can create a lower-dimensional representation of your data.
Imagine your data points scattered in a multi-dimensional cloud. PCA essentially tries to find the best orientation of axes to view this cloud. The first principal component is the direction in which the cloud is most spread out. The second is the next most spread-out direction, perpendicular to the first.
The mathematical foundation of PCA lies in the eigendecomposition of the data's covariance matrix (or singular value decomposition of the data matrix, which is often more numerically stable). Let X be your data matrix where rows are samples and columns are features.
The result is a new dataset Xpca with k features. These new features are the principal components, and they are ordered by the amount of variance they capture from the original data.
Julia's ecosystem provides powerful tools for statistical analysis. For PCA, the MultivariateStats.jl
package is commonly used. Let's walk through an example.
First, ensure you have MultivariateStats.jl
installed. If not, you can add it using Julia's package manager:
using Pkg
Pkg.add("MultivariateStats")
Pkg.add("Statistics") # For mean and std, if manually standardizing
Pkg.add("Random") # For generating sample data
Now, let's apply PCA to a small synthetic dataset. We'll generate data with 3 features and reduce it to 2 principal components.
using MultivariateStats
using Statistics
using Random
# Generate some synthetic 3D data
Random.seed!(42) # for reproducibility
n_samples = 100
X1 = randn(n_samples)
X2 = 0.5 * X1 + 0.5 * randn(n_samples)
X3 = 0.3 * X1 - 0.2 * X2 + 0.3 * randn(n_samples)
# MultivariateStats.jl expects data with features in rows and samples in columns (d x n)
data = hcat(X1, X2, X3)'
# 1. Standardize the data (important for PCA)
# For data where features are in rows:
means = mean(data, dims=2) # Calculate mean for each feature (row)
stds = std(data, dims=2) # Calculate std dev for each feature (row)
data_std = (data .- means) ./ stds
# 2. Fit the PCA model
# We can specify the number of output dimensions (maxoutdim)
# Or, specify the proportion of variance to retain (e.g., pratio=0.95)
# Here, we reduce to 2 dimensions.
M = fit(PCA, data_std, maxoutdim=2)
# 3. Transform the data to the new lower-dimensional space
data_pca = transform(M, data_std)
# The transformed data (data_pca) now has 2 features (rows) and n_samples columns.
# If you prefer samples as rows and features as columns for further work:
data_pca_samples_as_rows = data_pca'
println("Original data dimensions (features x samples): ", size(data))
println("Standardized data dimensions: ", size(data_std))
println("PCA model principal components (projection matrix W):")
# The projection matrix contains the eigenvectors as columns
# These define the directions of the principal components in the original feature space
println(projection(M))
println("Transformed data dimensions (new features x samples): ", size(data_pca))
println("Explained variance ratio per component: ", principalratio(M))
println("Cumulative explained variance: ", cumsum(principalratio(M)))
In this example, data
is a 3×100 matrix (3 features, 100 samples). After PCA, data_pca
becomes a 2×100 matrix. The projection(M)
function returns the matrix Wk, which contains the eigenvectors used for the projection. principalratio(M)
provides the proportion of variance explained by each selected principal component, and cumsum(principalratio(M))
shows the cumulative explained variance.
A common question when applying PCA is how many principal components to keep. There isn't a single definitive answer, but here are two widely used approaches:
Cumulative Explained Variance: Decide on a threshold for the total variance you want to retain. For example, you might aim to keep enough components to explain 90%, 95%, or 99% of the original variance. You can calculate the cumulative sum of explained variance ratios (as shown in the Julia example using cumsum(principalratio(M))
) and pick the smallest k that meets your threshold.
Scree Plot: A scree plot visualizes the eigenvalues (or the proportion of variance explained) for each principal component, sorted in descending order. Typically, you'll observe a sharp drop in explained variance after the first few components, followed by a leveling off for later components. This point where the slope changes is often called the "elbow." One heuristic is to keep the components before this elbow, as these are considered to capture the most significant variance.
Here's an example of what a scree plot might show, illustrating the variance explained by five principal components:
A scree plot displaying the proportion of variance explained by individual principal components (bars) and the cumulative variance (line). PC1 explains 45% of the variance, PC2 explains 30%, and so on. The "elbow" might be considered after PC2 or PC3. To capture, for instance, at least 90% of the variance, you would select the first three principal components.
PCA offers several advantages:
However, there are also some points to keep in mind:
PCA is a foundational technique in data analysis and machine learning, particularly useful when dealing with datasets that have a large number of features. It provides a systematic way to reduce complexity while attempting to retain the most significant information present in the data. As you continue your work, you'll find PCA applied in various contexts, from preprocessing data for supervised learning algorithms to exploratory data analysis and visualization.
Was this section helpful?
© 2025 ApX Machine Learning