While Gaussian Processes (GPs) provide an elegant framework for regression tasks with built-in uncertainty quantification, applying them to classification requires navigating a significant hurdle: the likelihood function is no longer Gaussian. In binary classification, we typically deal with Bernoulli likelihoods (outputting 0 or 1), and for multi-class problems, categorical likelihoods. These step away from the convenient conjugate relationship between Gaussian priors and Gaussian likelihoods that makes GP regression analytically tractable.
The standard way to tackle GP classification is by introducing a latent function, f(x), which follows a Gaussian Process prior:
f(x)∼GP(m(x),k(x,x′))Here, m(x) is the mean function (often set to zero) and k(x,x′) is the covariance (kernel) function, just as in GP regression.
Instead of this latent function directly representing the output, it's passed through a link function (or inverse link function in GLM terminology, often called a "squashing" function here) to produce probabilities for the class labels. For binary classification (y∈{0,1}), a common choice is the logistic sigmoid function, σ(z)=1/(1+e−z):
p(y=1∣f(x))=σ(f(x))Alternatively, the probit link function (the cumulative density function of a standard normal distribution) can be used.
The full model involves the GP prior over the latent function f and the non-Gaussian likelihood for the observed labels y given the latent function values f=[f(x1),...,f(xN)]T at the training points X:
p(y∣f)=i=1∏Np(yi∣f(xi))For the logistic sigmoid link, this likelihood is a product of Bernoulli probabilities: p(yi∣f(xi))=σ(f(xi))yi(1−σ(f(xi)))1−yi.
The core difficulty arises when we try to compute the posterior distribution over the latent function, p(f∣y,X). According to Bayes' theorem:
p(f∣y,X)=p(y∣X)p(y∣f)p(f∣X)Here, p(f∣X) is the GP prior (multivariate Gaussian), but p(y∣f) is non-Gaussian (e.g., product of Bernoullis). The product of a Gaussian and a non-Gaussian term generally does not result in a Gaussian distribution. This means the posterior p(f∣y,X) is non-Gaussian and intractable to compute analytically.
Furthermore, the marginal likelihood (or evidence), p(y∣X), which is essential for model comparison and hyperparameter optimization, also becomes intractable:
p(y∣X)=∫p(y∣f)p(f∣X)dfThis integral involves integrating a non-Gaussian likelihood against a high-dimensional Gaussian prior, which usually lacks a closed-form solution.
To perform inference and make predictions, we need to approximate the intractable posterior p(f∣y,X). The most common techniques are the Laplace Approximation and Expectation Propagation (EP).
The Laplace approximation finds the mode of the posterior distribution and then fits a Gaussian distribution centered at that mode. The covariance of this Gaussian is determined by the curvature of the log posterior at the mode.
Find the Mode: We maximize the log posterior probability with respect to f:
logp(f∣y,X)=logp(y∣f)+logp(f∣X)−logp(y∣X)Since logp(y∣X) is constant with respect to f, we maximize Ψ(f)=logp(y∣f)+logp(f∣X). This is equivalent to finding the Maximum A Posteriori (MAP) estimate of f, denoted fMAP. Because logp(f∣X) is quadratic in f (from the Gaussian prior) and logp(y∣f) is typically concave (e.g., for logistic/probit likelihoods), this optimization problem is often convex and can be solved efficiently using methods like Newton-Raphson.
Fit Gaussian: We approximate the posterior with a Gaussian q(f∣y,X)=N(f∣μLA,ΣLA), where:
The Laplace approximation provides a Gaussian approximation q(f) to the true posterior p(f∣y,X), making subsequent calculations tractable. However, its accuracy depends on how well the true posterior is represented by a Gaussian centered at its mode. If the posterior is significantly skewed or multi-modal, the approximation might be poor.
Expectation Propagation (EP) offers an alternative, often more accurate, approach. It seeks to approximate the posterior p(f∣y,X) with a distribution q(f) from a chosen family (typically Gaussian), by minimizing the KL divergence KL(p∣∣q) in a specific way.
EP approximates the full posterior by iteratively refining an approximation based on incorporating the likelihood terms one by one. It tries to match certain moments (usually the mean and variance) of the approximate marginal posterior distributions with those of hypothetical distributions obtained by combining the prior with individual likelihood terms.
EP generally provides a more flexible and often more accurate approximation than Laplace, particularly when the true posterior deviates significantly from a simple Gaussian shape around the mode. However, EP is computationally more intensive, its convergence is not always guaranteed, and the moment-matching integrals can be challenging.
Once we have a Gaussian approximation to the posterior over the latent values at the training points, q(f∣y,X)≈N(μapprox,Σapprox) (where 'approx' is LA or EP), we can derive the approximate predictive distribution for the latent function value f∗ at a new test point x∗.
Similar to GP regression, the joint distribution of f and f∗ under the prior is Gaussian. Using the properties of conditioning Gaussian distributions, the approximate predictive distribution q(f∗∣y,X) is also Gaussian:
q(f∗∣y,X)=N(f∗∣μ∗,σ∗2)where μ∗ and σ∗2 depend on the approximate posterior mean μapprox, covariance Σapprox, the kernel evaluations involving x∗, and the training data X.
To get the final predictive probability for the class label y∗=1, we need to average the likelihood σ(f∗) over this approximate predictive distribution:
p(y∗=1∣x∗,y,X)≈∫σ(f∗)q(f∗∣y,X)df∗This integral is still often intractable (e.g., integrating the logistic sigmoid over a Gaussian). Common ways to compute it include:
The resulting predictive probability incorporates the uncertainty captured by the approximate posterior distribution of f∗.
Example visualization of a GP classification result in 1D. The dashed line shows the mean of the approximate posterior latent function f(x). The shaded region represents the uncertainty (e.g., +/- 2 standard deviations) in f(x). Data points from two classes are shown. The probability of a point belonging to Class 1 is estimated by applying the sigmoid function to f(x). The decision boundary occurs where the mean latent function crosses zero (or where the sigmoid output is 0.5). Uncertainty in f(x) translates to uncertainty in the classification, especially near the decision boundary.
Just like in GP regression, the kernel hyperparameters (e.g., lengthscale, variance) and any parameters of the likelihood function need to be determined. This is typically done by maximizing an approximation to the log marginal likelihood logp(y∣X,θ), where θ denotes the hyperparameters. Both the Laplace approximation and EP provide estimates of this quantity as byproducts of their approximation process. Gradient-based optimization methods are commonly used to find the hyperparameters θ that maximize this approximate marginal likelihood.
In summary, GP classification extends the core ideas of GPs to scenarios with non-Gaussian likelihoods by introducing a latent function and employing approximation techniques like Laplace or EP to handle the resulting analytical intractability. These methods allow us to obtain approximate posterior distributions and predictive probabilities, retaining the Bayesian approach's ability to quantify uncertainty even in classification settings. The choice between Laplace and EP often involves a trade-off between computational cost and the accuracy of the posterior approximation.
© 2025 ApX Machine Learning