Standard Gaussian Process (GP) methods offer a powerful, non-parametric Bayesian approach for tasks like regression and classification, providing not just predictions but also principled uncertainty estimates. However, their practical application is often hindered by computational demands, especially as dataset sizes grow.
Recall from the section on GP regression formulation that the core computations involve the N×N covariance matrix KNN, constructed from evaluating the kernel function between all pairs of the N training data points. Key operations include:
The O(N3) complexity for training makes standard GPs computationally infeasible for datasets where N grows beyond a few thousand data points. Modern datasets frequently contain hundreds of thousands or millions of samples, necessitating approximation techniques to make GPs practical at scale.
The main computational bottleneck is the manipulation of the dense N×N covariance matrix KNN. Approximation methods aim to circumvent this bottleneck, primarily by reducing the effective number of points the model depends on or by introducing structural assumptions into the covariance matrix. The most successful and widely used family of techniques relies on sparse approximations.
Sparse methods introduce a smaller set of M inducing points (M≪N), denoted by Z={z1,...,zM}, which act as representative points or a summary of the full dataset. The core idea is that the function's behavior across the entire input space can be adequately captured by its values u=f(Z) at these M locations. The inference and prediction process is then reformulated to depend computationally on M rather than N.
Instead of conditioning directly on the full dataset y, sparse GP methods condition the GP f on the function values u at the inducing locations Z. The approximation hinges on how the relationship between u and y is handled. The inducing locations Z themselves can be chosen in various ways: as a subset of the training inputs X, placed on a grid, selected randomly, or, most effectively, optimized as part of the model training process.
Let's examine some common sparse approximation techniques.
This is the most straightforward baseline: simply select a random subset of M data points from the original N points and perform exact GP inference using only this subset.
The SoR method uses M inducing points Z (not necessarily a subset of X) and bases predictions on these points. The predictive mean is approximated as E[f∗∣y]≈K∗MKMM−1uM, where uM represents the GP values at Z. Various ways exist to estimate uM from the training data y, often involving approximations themselves.
FITC introduces a specific structural assumption: given the inducing variables u, the training outputs yi are conditionally independent. This approximates the true covariance matrix KNN with a sum of a low-rank term and a diagonal matrix: KNN≈QNN+diag(Λ), where QNN=KNMKMM−1KMN.
Modern sparse GP methods are predominantly based on variational inference (VI). Instead of directly approximating the covariance matrix, these methods approximate the posterior distribution p(f∣y). They introduce a variational distribution q(u) over the inducing variables u=f(Z), where q(u) is typically chosen to be a multivariate Gaussian N(m,S). The goal is to minimize the KL divergence between the approximate posterior q(f)=∫p(f∣u)q(u)du and the true posterior p(f∣y).
This minimization is achieved by maximizing the Evidence Lower Bound (ELBO) on the marginal likelihood logp(y):
L(θ,Z,m,S)=Eq(f)[logp(y∣f)]−KL[q(u)∣∣p(u)]Here, p(u)=N(0,KMM) is the GP prior evaluated at the inducing points Z. θ represents the kernel hyperparameters. A significant advantage of this variational framework is that the inducing point locations Z can be treated as variational parameters and optimized alongside the kernel hyperparameters θ and the parameters of q(u) (i.e., m and S) by maximizing the ELBO using gradient-based methods. This allows the model to learn the optimal placement of inducing points to best summarize the data.
The VFE framework naturally leads to stochastic optimization, making it highly scalable. The first term in the ELBO, the expected log-likelihood, can be written as a sum over the N data points: ∑i=1NEq(fi)[logp(yi∣fi)]. This structure is amenable to the techniques of Stochastic Variational Inference (SVI) covered in Chapter 3.
Instead of computing the expectation over the full dataset at each step, we can use mini-batches B⊂{1,...,N} to compute noisy but unbiased estimates of the ELBO's gradient:
∇L≈∣B∣Ni∈B∑∇Eq(fi)[logp(yi∣fi)]−∇KL[q(u)∣∣p(u)]These stochastic gradients are then used to update the parameters (θ,Z,m,S) using optimizers like Adam or RMSprop.
The number of inducing points, M, is a critical hyperparameter in sparse GP methods. It directly controls the balance between computational efficiency and the fidelity of the approximation to the full GP.
In practice, M is often determined by the available computational budget (time and memory) and monitored via performance on a validation set. Common choices range from M=100 to M=1000 or slightly more, depending on the dataset size and complexity. Visualizing the learned inducing point locations Z can sometimes provide insights.
Comparison between an exact GP fit and a sparse approximation (SVGP) using M=5 optimized inducing points (purple crosses) on a 1D regression task. The sparse GP effectively approximates the mean and uncertainty of the exact GP at a fraction of the computational cost by learning the optimal positions for its inducing points.
Approximation methods, especially sparse variational techniques like SVGP, are indispensable for applying Gaussian Processes to large-scale problems.
While exact GPs are preferred for smaller datasets where computation is not limiting, sparse approximations empower practitioners to use the flexible, non-parametric, and uncertainty-aware nature of Gaussian Processes on the large datasets common in contemporary machine learning. The choice among sparse methods typically involves considering the dataset size, desired accuracy, and available computational resources, with SVGP being a strong default for very large datasets.
© 2025 ApX Machine Learning