Most machine learning models you've encountered so far likely fall under the umbrella of parametric models. Think of linear regression, logistic regression, or even standard deep neural networks. These models assume a specific functional form with a fixed, finite number of parameters (like weights β or network weights W). We use the data to estimate the values of these parameters, typically through optimization (like maximum likelihood estimation) or Bayesian inference (finding the posterior p(θ∣D)). The core assumption is that once these parameters are determined, they capture everything we need to know about the relationship between inputs and outputs, constrained by the chosen model structure.
However, what happens when the underlying process generating the data is more complex than our fixed parametric form allows? Forcing data into a rigid model structure can lead to underfitting or biased conclusions. We might not know the "right" complexity beforehand. This is where non-parametric methods come into play.
The term "non-parametric" can be slightly misleading. It doesn't usually mean no parameters. Instead, it refers to models where the number of effective parameters is not fixed a priori but can grow and adapt as more data becomes available. These models offer greater flexibility, allowing the structure complexity itself to be inferred from the data.
In the Bayesian framework, this translates to defining priors not just over a finite set of parameters, but over more complex, potentially infinite-dimensional objects. Instead of a prior p(θ) where θ∈Rd, we might consider priors over functions, partitions, or measures. This is the domain of Bayesian non-parametric modeling.
Gaussian Processes, the focus of this chapter, are a cornerstone of Bayesian non-parametrics, particularly for regression and classification tasks. Instead of assuming a specific parametric function like f(x)=β0+β1x1+...+βpxp and placing priors on the βi coefficients, a Gaussian Process defines a prior directly over the space of possible functions f(x). It assumes that the function values f(x1),f(x2),...,f(xn) at any finite set of input points x1,...,xn follow a multivariate Gaussian distribution.
This function-centric view provides remarkable flexibility. The model complexity is not limited by a fixed number of parameters; it adapts based on the observed data and the properties encoded in the prior (specifically, the covariance or kernel function, which we'll discuss shortly). This approach naturally incorporates uncertainty quantification: because we infer a distribution over functions, we get predictions that come with associated confidence intervals, reflecting areas where the model is more or less certain.
Consider a simple regression scenario. A parametric model like linear regression might struggle if the true relationship is non-linear. A Bayesian non-parametric model like a GP can capture complex patterns without requiring us to specify the exact form of the non-linearity beforehand.
Example comparing a simple linear parametric fit to a more flexible non-parametric fit (illustrated conceptually here). The non-parametric approach can better capture the data's trend and provides uncertainty bands that often widen in regions with less data.
This chapter will guide you through the mathematical foundations and practical implementation of Gaussian Processes. We'll start by formally defining GPs as priors over functions, explore how covariance functions shape these priors, and derive the equations for making predictions in regression settings. You'll learn how to handle hyperparameters and extend the framework to classification, along with techniques to make GPs computationally feasible for larger datasets.
© 2025 ApX Machine Learning