Okay, let's transition from the abstract concept of Gaussian Processes as priors over functions to the concrete mechanics of performing regression. The core idea is elegant: if we assume our underlying function comes from a GP prior, and our observations are noisy versions of this function, then predicting the function's value at new points becomes a standard Bayesian inference problem involving multivariate Gaussians.
Recall that a Gaussian Process defines a distribution over functions. We usually specify it with a mean function m(x) and a covariance function (kernel) k(x,x′). For simplicity, we often start by assuming a zero mean function, m(x)=0.
In a regression setting, we have a training dataset consisting of N input points X={x1,…,xN} and their corresponding observed target values y={y1,…,yN}T. The fundamental assumption in GP regression is that these observed values yi are noisy realizations of an underlying latent function f evaluated at the inputs xi. Specifically, we assume:
yi=f(xi)+ϵiwhere f∼GP(m(x),k(x,x′)) and the noise term ϵi is typically assumed to be independent and identically distributed Gaussian noise with zero mean and variance σn2:
ϵi∼N(0,σn2)Let f=[f(x1),…,f(xN)]T be the vector of latent function values at the training inputs. According to the GP prior, this vector follows a multivariate Gaussian distribution:
f∼N(m,K)where m is the vector of prior means [m(x1),…,m(xN)]T and K is the N×N covariance matrix (or Gram matrix) evaluated at all pairs of training inputs: Kij=k(xi,xj).
Combining the GP prior on f and the Gaussian noise model, the marginal distribution of the observed targets y also becomes Gaussian:
y∣X∼N(m,K+σn2I)Here, I is the N×N identity matrix, and the noise variance σn2 is added to the diagonal elements of the covariance matrix K, reflecting the independent noise added to each observation. This distribution p(y∣X) is often called the marginal likelihood or evidence.
Our goal is to predict the function value f∗ at a new test input point x∗. We might also want to predict at multiple test points X∗={x∗1,…,x∗M}. Let f∗=[f(x∗1),…,f(x∗M)]T be the vector of latent function values at these test points.
The core property of GPs is that any finite set of function values drawn from a GP follows a multivariate Gaussian distribution. Therefore, the joint distribution of the observed training targets y and the latent function values f∗ at the test points, conditioned on the inputs X and X∗, is also Gaussian:
(yf∗)∼N((mm∗),(K(X,X)+σn2IK(X∗,X)K(X,X∗)K(X∗,X∗)))Let's break down the covariance matrix components:
And the mean vectors:
We are interested in the posterior predictive distribution p(f∗∣X,y,X∗), which tells us the distribution of function values at the test points given the observed training data. Since the joint distribution is Gaussian, this conditional distribution is also Gaussian.
Using standard results for conditional Gaussian distributions, we can derive the mean and covariance of this predictive distribution. Assuming a zero prior mean function (m(x)=0, thus m=0 and m∗=0) for simplicity first:
The posterior predictive mean is:
E[f∗∣X,y,X∗]=K(X∗,X)[K(X,X)+σn2I]−1yThe posterior predictive covariance is:
Cov[f∗∣X,y,X∗]=K(X∗,X∗)−K(X∗,X)[K(X,X)+σn2I]−1K(X,X∗)If we use a non-zero prior mean function m(x), the equations become slightly more general:
Posterior Predictive Mean:
E[f∗∣X,y,X∗]=m∗+K(X∗,X)[K(X,X)+σn2I]−1(y−m)Posterior Predictive Covariance:
Cov[f∗∣X,y,X∗]=K(X∗,X∗)−K(X∗,X)[K(X,X)+σn2I]−1K(X,X∗)Notice that the posterior covariance does not depend on the observed targets y, only on the inputs X, X∗, the kernel k, and the noise variance σn2.
The ability to obtain not just point predictions (the mean) but also a full measure of uncertainty (the variance/covariance) is a significant advantage of GP regression.
{"data": [{"x": [-3, -2, -1, 0, 1, 2, 3], "y": [-0.5, 0, 0.8, 1, 0.7, 0.2, -0.3], "mode": "markers", "type": "scatter", "name": "Training Data", "marker": {"color": "#1c7ed6", "size": 8}}, {"x": [-4, -3.5, -3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4], "y": [-0.8, -0.65, -0.5, 0.1, 0.0, 0.5, 0.8, 0.95, 1.0, 0.85, 0.7, 0.4, 0.2, 0.0, -0.3, -0.4, -0.6], "mode": "lines", "type": "scatter", "name": "Predictive Mean", "line": {"color": "#f03e3e", "width": 2}}, {"x": [-4, -3.5, -3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4, 3.5, 3, 2.5, 2, 1.5, 1, 0.5, 0, -0.5, -1, -1.5, -2, -2.5, -3, -3.5, -4], "y": [-0.8+0.8, -0.65+0.5, -0.5+0.2, 0.1+0.1, 0.0+0.05, 0.5+0.05, 0.8+0.05, 0.95+0.05, 1.0+0.05, 0.85+0.05, 0.7+0.05, 0.4+0.05, 0.2+0.05, 0.0+0.1, -0.3+0.2, -0.4+0.5, -0.6+0.8, -0.6-0.8, -0.4-0.5, -0.3-0.2, 0.0-0.1, 0.2-0.05, 0.4-0.05, 0.7-0.05, 0.85-0.05, 1.0-0.05, 0.95-0.05, 0.8-0.05, 0.5-0.05, 0.0-0.05, 0.1-0.1, -0.5-0.2, -0.65-0.5, -0.8-0.8], "mode": "lines", "type": "scatter", "name": "Predictive Uncertainty (Mean +/- 2*StdDev)", "fill": "toself", "fillcolor": "rgba(250, 82, 82, 0.2)", "line": {"color": "rgba(255, 255, 255, 0)"}}], "layout": {"title": "Gaussian Process Regression Example", "xaxis": {"title": "Input (x)"}, "yaxis": {"title": "Output (y)"}, "legend": {"yanchor": "top", "y": 0.99, "xanchor": "left", "x": 0.01}, "margin": {"l": 40, "r": 10, "t": 40, "b": 40}}}
Example of 1D GP regression. The blue points are training data. The red line is the posterior predictive mean. The shaded red area represents +/- two standard deviations from the mean, visualizing the model's uncertainty. Uncertainty is lower near the training points and increases further away.
The primary computational bottleneck in standard GP regression is the calculation of the inverse term [K(X,X)+σn2I]−1. This involves inverting an N×N matrix, which typically requires O(N3) computational time and O(N2) memory to store the matrix K.
This O(N3) complexity limits the applicability of exact GP regression to datasets with typically fewer than a few thousand data points. We will discuss methods to scale GPs to larger datasets later in this chapter.
Now that we have the core formulation for regression, the next steps involve understanding how to choose the kernel function k(x,x′) and how to set its parameters (hyperparameters) along with the noise variance σn2.
© 2025 ApX Machine Learning