Okay, let's transition from the abstract concept of Gaussian Processes as priors over functions to the concrete mechanics of performing regression. The core idea is elegant: if we assume our underlying function comes from a GP prior, and our observations are noisy versions of this function, then predicting the function's value at new points becomes a standard Bayesian inference problem involving multivariate Gaussians.
The Setup: Observations and the Latent Function
Recall that a Gaussian Process defines a distribution over functions. We usually specify it with a mean function m(x) and a covariance function (kernel) k(x,x′). For simplicity, we often start by assuming a zero mean function, m(x)=0.
In a regression setting, we have a training dataset consisting of N input points X={x1,…,xN} and their corresponding observed target values y={y1,…,yN}T. The fundamental assumption in GP regression is that these observed values yi are noisy realizations of an underlying latent function f evaluated at the inputs xi. Specifically, we assume:
yi=f(xi)+ϵi
where f∼GP(m(x),k(x,x′)) and the noise term ϵi is typically assumed to be independent and identically distributed Gaussian noise with zero mean and variance σn2:
ϵi∼N(0,σn2)
Let f=[f(x1),…,f(xN)]T be the vector of latent function values at the training inputs. According to the GP prior, this vector follows a multivariate Gaussian distribution:
f∼N(m,K)
where m is the vector of prior means [m(x1),…,m(xN)]T and K is the N×N covariance matrix (or Gram matrix) evaluated at all pairs of training inputs: Kij=k(xi,xj).
Combining the GP prior on f and the Gaussian noise model, the marginal distribution of the observed targets y also becomes Gaussian:
y∣X∼N(m,K+σn2I)
Here, I is the N×N identity matrix, and the noise variance σn2 is added to the diagonal elements of the covariance matrix K, reflecting the independent noise added to each observation. This distribution p(y∣X) is often called the marginal likelihood or evidence.
Predicting New Points
Our goal is to predict the function value f∗ at a new test input point x∗. We might also want to predict at multiple test points X∗={x∗1,…,x∗M}. Let f∗=[f(x∗1),…,f(x∗M)]T be the vector of latent function values at these test points.
The core property of GPs is that any finite set of function values drawn from a GP follows a multivariate Gaussian distribution. Therefore, the joint distribution of the observed training targets y and the latent function values f∗ at the test points, conditioned on the inputs X and X∗, is also Gaussian:
(yf∗)∼N((mm∗),(K(X,X)+σn2IK(X∗,X)K(X,X∗)K(X∗,X∗)))
Let's break down the covariance matrix components:
- K(X,X): The N×N matrix of covariances between all pairs of training inputs (we previously called this K).
- K(X,X∗): The N×M matrix of covariances between training inputs and test inputs. K(X,X∗)ij=k(xi,x∗j).
- K(X∗,X): The M×N matrix of covariances between test inputs and training inputs. Note that K(X∗,X)=K(X,X∗)T.
- K(X∗,X∗): The M×M matrix of covariances between all pairs of test inputs. K(X∗,X∗)ij=k(x∗i,x∗j).
And the mean vectors:
- m: The N×1 vector of prior means at training inputs.
- m∗: The M×1 vector of prior means at test inputs.
The Predictive Distribution
We are interested in the posterior predictive distribution p(f∗∣X,y,X∗), which tells us the distribution of function values at the test points given the observed training data. Since the joint distribution is Gaussian, this conditional distribution is also Gaussian.
Using standard results for conditional Gaussian distributions, we can derive the mean and covariance of this predictive distribution. Assuming a zero prior mean function (m(x)=0, thus m=0 and m∗=0) for simplicity first:
The posterior predictive mean is:
E[f∗∣X,y,X∗]=K(X∗,X)[K(X,X)+σn2I]−1y
The posterior predictive covariance is:
Cov[f∗∣X,y,X∗]=K(X∗,X∗)−K(X∗,X)[K(X,X)+σn2I]−1K(X,X∗)
If we use a non-zero prior mean function m(x), the equations become slightly more general:
Posterior Predictive Mean:
E[f∗∣X,y,X∗]=m∗+K(X∗,X)[K(X,X)+σn2I]−1(y−m)
Posterior Predictive Covariance:
Cov[f∗∣X,y,X∗]=K(X∗,X∗)−K(X∗,X)[K(X,X)+σn2I]−1K(X,X∗)
Notice that the posterior covariance does not depend on the observed targets y, only on the inputs X, X∗, the kernel k, and the noise variance σn2.
Interpretation of the Predictive Equations
- Predictive Mean: The predictive mean E[f∗∣…] starts with the prior mean m∗ and adds a correction term. This correction term is a linear combination of the centered observed targets (y−m). The weights of this combination depend on the covariances between the test points X∗ and the training points X, modulated by the inverse of the noisy covariance matrix of the training data. Intuitively, training points that are "closer" (as defined by the kernel k) to a test point x∗ will have a stronger influence on its predicted value.
- Predictive Covariance: The predictive covariance Cov[f∗∣…] starts with the prior covariance K(X∗,X∗) and subtracts a positive semi-definite matrix. This subtraction represents the reduction in uncertainty about f∗ gained from observing the training data y. The reduction is larger when the training points provide more information about the test points (i.e., when K(X∗,X) is large). The diagonal elements of this covariance matrix, Var[f∗i∣…], give us the predictive variance for each individual test point x∗i. This variance is typically smaller near the training data points and larger further away, reflecting increased uncertainty where we lack data.
The ability to obtain not just point predictions (the mean) but also a full measure of uncertainty (the variance/covariance) is a significant advantage of GP regression.
Example of 1D GP regression. The blue points are training data. The red line is the posterior predictive mean. The shaded red area represents +/- two standard deviations from the mean, visualizing the model's uncertainty. Uncertainty is lower near the training points and increases further away.
Computational Considerations
The primary computational bottleneck in standard GP regression is the calculation of the inverse term [K(X,X)+σn2I]−1. This involves inverting an N×N matrix, which typically requires O(N3) computational time and O(N2) memory to store the matrix K.
- Training/Inference: Computing the inverse takes O(N3).
- Prediction: Once the inverse (or its Cholesky decomposition) is computed and stored, predicting the mean for M test points takes O(NM) time, and predicting the full covariance takes O(NM2). Predicting the variance only for M test points takes O(NM) time after the initial O(N3) investment.
This O(N3) complexity limits the applicability of exact GP regression to datasets with typically fewer than a few thousand data points. We will discuss methods to scale GPs to larger datasets later in this chapter.
Now that we have the core formulation for regression, the next steps involve understanding how to choose the kernel function k(x,x′) and how to set its parameters (hyperparameters) along with the noise variance σn2.