In many machine learning models you've encountered, like linear regression or standard neural networks, we define a model structure with a fixed set of parameters (weights, biases). Bayesian approaches typically involve placing prior distributions over these parameters. For instance, in Bayesian linear regression, we might assume y=wTx+ϵ, and place a Gaussian prior on the weight vector w. Our uncertainty is about the values of these parameters.
Gaussian Processes (GPs) offer a different perspective. Instead of defining priors over parameters of a specific function form, GPs allow us to define a prior distribution directly over functions themselves. Think about it: what if we could model the underlying function f(x) generating our data as a random variable drawn from some distribution? This is precisely what GPs enable.
You're familiar with multivariate Gaussian distributions, which describe the probability distribution of a finite-dimensional vector f=[f1,f2,...,fn]T. A multivariate Gaussian is characterized by a mean vector μ and a covariance matrix Σ: f∼N(μ,Σ) The covariance matrix Σ, where Σij=Cov(fi,fj), captures the relationships between the different elements of the vector f.
Now, imagine we want to define a distribution over functions f(x), where x can be any point in some input domain (potentially continuous and infinite). A function can be thought of as an infinitely long vector, where each element corresponds to the function's value f(x) at a specific input x. How can we extend the idea of a multivariate Gaussian to this infinite-dimensional setting?
This leads us to the definition of a Gaussian Process.
A Gaussian Process is formally defined as a collection of random variables, any finite subset of which has a joint Gaussian distribution.
Let f(x) represent the value of the random function at input x. A Gaussian Process prior on f(x) is specified by two functions:
We denote a function f drawn from a GP as: f(x)∼GP(m(x),k(x,x′))
The core implication of the definition is this: if we choose any finite set of input points x=[x1,x2,...,xn]T, the corresponding vector of function values f=[f(x1),f(x2),...,f(xn)]T will be drawn from a multivariate Gaussian distribution: f∼N(μ,K) where:
The mean function m(x) represents our prior belief about the average shape of the function before observing any data. In practice, it's common to assume a zero mean function, m(x)=0, especially after standardizing the data. This simplifies calculations and places the emphasis on the covariance function to capture the function's structure.
The covariance function k(x,x′) is where the magic happens. It defines the properties of the functions drawn from the GP prior. It dictates how similar the function values are expected to be at different input points.
The choice of kernel encodes our assumptions about the function's characteristics, such as smoothness, periodicity, or stationarity. We will examine different kernel functions in detail in the next section.
To get a feel for what this "distribution over functions" looks like, we can draw sample functions from a GP prior. We do this by:
Each line represents a single function drawn from the specified GP prior distribution. Notice how the functions exhibit smoothness, a property encoded by the chosen RBF kernel.
In essence, the GP defines a probability distribution over an infinite-dimensional function space. This prior captures our beliefs about the function before we see any data points. When we combine this prior with observed data (using Bayes' theorem, as we'll see in GP regression), we obtain a posterior distribution, which is also a Gaussian Process, but one that is updated to reflect the information from the data. This posterior GP allows us to make predictions about the function's value at new points, complete with uncertainty estimates.
© 2025 ApX Machine Learning