Okay, we've established that Gaussian Processes (GPs) provide a flexible, non-parametric way to model functions, placing a prior directly over the function space. We saw how the covariance function, or kernel, encodes our assumptions about the function's properties, like smoothness or periodicity. We also formulated GP regression, allowing us to predict f∗ at new points x∗ along with uncertainty estimates E[f∗∣y] and Var[f∗∣y].
However, the kernel itself, like the popular Radial Basis Function (RBF) kernel, k(x,x′)=σf2exp(−2l2∥x−x′∥2), has parameters: the signal variance σf2 and the lengthscale l. Similarly, the Gaussian likelihood often assumed in GP regression, p(y∣f)=N(y∣f,σn2), introduces a noise variance parameter σn2. These parameters, collectively denoted as θ={σf2,l,σn2}, are called hyperparameters. How do we choose optimal values for them?
Unlike typical supervised learning where we might use cross-validation to tune hyperparameters, the Bayesian nature of GPs offers a more integrated approach: optimizing the marginal likelihood (also known as the evidence).
The core idea is to find the hyperparameter values θ that maximize the probability of observing the actual data y given the inputs X and the hyperparameters θ. This probability is obtained by integrating out (or marginalizing) the unknown function values f at the training points:
p(y∣X,θ)=∫p(y∣f,σn2)p(f∣X,θkernel)dfHere, θkernel represents the kernel hyperparameters (like l and σf2), and we explicitly show the noise variance σn2 from the likelihood term.
Because both the prior p(f∣X,θkernel) and the likelihood p(y∣f,σn2) are Gaussian (assuming a Gaussian likelihood for regression), this integral can be solved analytically. The result is itself a Gaussian distribution for y:
y∣X,θ∼N(0,Ky)=N(0,Kf+σn2I)where Kf is the N×N covariance matrix evaluated at the training inputs X using the kernel parameters, Kij=k(xi,xj∣θkernel), and I is the identity matrix. Ky=Kf+σn2I is the covariance matrix of the observed data points y, incorporating both function variability (from Kf) and observation noise (from σn2I).
For numerical stability and convenience, we typically work with the log marginal likelihood:
logp(y∣X,θ)=−21yTKy−1y−21log∣Ky∣−2Nlog(2π)Let's break down the important terms:
Data Fit Term: −21yTKy−1y This term measures how well the model explains the observed data y. It rewards hyperparameter settings where the observed outputs y are probable under the Gaussian distribution defined by the covariance Ky. If the predictions based on Ky are far from the actual y, this term becomes a large negative number (a penalty).
Complexity Penalty Term: −21log∣Ky∣ This term involves the logarithm of the determinant of the covariance matrix Ky. The determinant ∣Ky∣ measures the "volume" spanned by the distribution. Kernels that lead to very high correlations between points (e.g., very large lengthscales) or very high variance (large σf2 or σn2) result in a larger determinant, and thus a larger penalty from this term. This term acts as an automatic "Occam's Razor," penalizing overly complex models (e.g., those that are too flexible or assume too much noise).
Normalization Constant: −2Nlog(2π) This term is constant with respect to the hyperparameters θ and can be ignored during optimization.
Maximizing the log marginal likelihood forces a trade-off: finding hyperparameters that allow the model to fit the data well (term 1) without becoming unnecessarily complex (term 2).
The goal is to find the hyperparameters θ∗ that maximize the log marginal likelihood:
θ∗=argθmaxlogp(y∣X,θ)Since the log marginal likelihood function is generally differentiable with respect to the hyperparameters θ (including kernel parameters like l,σf2 and the noise variance σn2), we can use standard gradient-based optimization algorithms. Common choices include:
To use these methods, we need the partial derivatives of the log marginal likelihood with respect to each hyperparameter θj∈θ. The derivative has a relatively clean analytical form:
∂θj∂logp(y∣X,θ)=21yTKy−1∂θj∂KyKy−1y−21Tr(Ky−1∂θj∂Ky)Here, ∂θj∂Ky is the matrix containing the element-wise partial derivatives of the kernel function (plus the noise term if θj=σn2) with respect to the hyperparameter θj. These derivatives are usually straightforward to compute for standard kernels. For instance, if Ky=Kf+σn2I and θj is a kernel parameter like lengthscale l, then ∂θj∂Ky=∂l∂Kf. If θj=σn2, then ∂θj∂Ky=I.
Modern GP libraries (like GPy, GPflow, scikit-learn's GaussianProcessRegressor
, GPytorch) automatically compute these gradients and perform the optimization for you. You typically just need to define the kernel structure and provide the data.
# Conceptual example using a scikit-learn like interface
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C, WhiteKernel
# Define the kernel with initial hyperparameters (these will be optimized)
# Kernel: k(x, x') = C(constant_value) * RBF(length_scale) + WhiteKernel(noise_level)
# We often optimize the log-transformed parameters for stability
kernel = C(1.0) * RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)
# Create the GP model
# n_restarts_optimizer helps find a better optimum by restarting from different points
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, random_state=42)
# Fit the model: This step performs the log marginal likelihood maximization
# X_train, y_train are your training data
# gp.fit(X_train, y_train)
# After fitting, the optimized hyperparameters are stored
# print(f"Optimized kernel: {gp.kernel_}")
# print(f"Log-Marginal-Likelihood: {gp.log_marginal_likelihood(gp.kernel_.theta)}")
# Example Output (Illustrative):
# Optimized kernel: 1.5**2 * RBF(length_scale=0.8) + WhiteKernel(noise_level=0.1)
# Log-Marginal-Likelihood: -15.23
While powerful, optimizing the marginal likelihood isn't without challenges:
n_restarts_optimizer
in the example above).Let's visualize how the log marginal likelihood (LML) might vary for a simple 1D regression problem with an RBF kernel, considering only the lengthscale (l) and signal variance (σf2) hyperparameters (keeping noise σn2 fixed for simplicity).
A contour plot showing the Log Marginal Likelihood (LML) values for different combinations of RBF kernel lengthscale and signal variance. The optimization algorithm searches for the peak (highest LML value) on this surface. The potentially complex shape highlights why multiple restarts might be needed to find the global optimum.
By finding the hyperparameters that maximize the probability of the observed data, marginal likelihood optimization provides a principled and often effective method for tuning GP models, directly embedding model selection within the Bayesian inference framework. This contrasts with methods like cross-validation, offering a more data-efficient approach, particularly when data is scarce. Once the optimal hyperparameters θ∗ are found, they are plugged back into the GP predictive equations for making predictions on new data.
© 2025 ApX Machine Learning