Covariance Functions (Kernels): Properties and Selection
As introduced previously, a Gaussian Process is completely specified by its mean function m(x) and its covariance function k(x,x′). While the mean function defines the average expected value of the function at any point x, it's the covariance function, often called the kernel, that truly defines the shape and structure of the functions we can model. The kernel encodes our assumptions about the function we are trying to learn, such as its smoothness, periodicity, or stationarity.
The Role of the Covariance Function (Kernel)
The kernel k(x,x′) quantifies the relationship between the function's values at two different input points, x and x′. Specifically, it defines the covariance between the random variables f(x) and f(x′):
If points x and x′ are considered "similar" by the kernel (i.e., k(x,x′) is large), then we expect their corresponding output values f(x) and f(x′) to be close. Conversely, if the kernel deems them dissimilar (k(x,x′) is small or zero), their output values are less correlated.
For any finite set of N input points X={x1,…,xN}, the kernel function defines the N×N covariance matrix K, where each element Kij=k(xi,xj). This matrix is central to GP calculations.
Essential Properties of Kernels
For a function k(⋅,⋅) to be a valid covariance function, it must satisfy two main properties:
Symmetry: The covariance between f(x) and f(x′) must be the same as between f(x′) and f(x).
k(x,x′)=k(x′,x)
Positive Semi-Definite (PSD): For any finite set of points {x1,…,xN} and any real-valued vector a=[a1,…,aN]T, the resulting covariance matrix K must satisfy:
aTKa≥0
This ensures that the variance of any linear combination of the function values, Var(∑iaif(xi))=aTKa, is non-negative, a fundamental requirement for any valid covariance matrix. Mercer's theorem provides a deeper connection, stating that any continuous, symmetric, PSD kernel can be thought of as an inner product in some (potentially infinite-dimensional) feature space.
Interpreting Kernel Hyperparameters
Most useful kernels have parameters, often called hyperparameters, that control their behavior. Understanding these is significant for effective GP modeling. Two common hyperparameters are:
Lengthscale (l): This parameter controls how quickly the correlation between function values decays as the distance between input points increases. A small lengthscale means correlation drops off quickly, leading to functions that vary rapidly (wiggly). A large lengthscale implies that points further apart remain correlated, resulting in smoother functions.
Output Variance / Signal Variance (σf2): This parameter scales the overall variance of the function values. It controls the typical amplitude or vertical range of the functions drawn from the GP prior. k(x,x)=σf2 often represents the prior variance at any single point x (assuming a zero mean function).
The effect of the lengthscale is illustrated below, showing samples drawn from GPs with a Squared Exponential kernel but different lengthscales.
Function samples drawn from a zero-mean GP prior with a Squared Exponential kernel (σf2=1). The purple line uses a shorter lengthscale (l=0.5), resulting in a rapidly varying function. The blue line uses a longer lengthscale (l=2.0), producing a much smoother function.
A Gallery of Common Kernels
Choosing the right kernel is often guided by prior knowledge about the function's expected characteristics. Here are some widely used kernels:
Squared Exponential (SE) Kernel / Radial Basis Function (RBF) Kernel
This is arguably the most common kernel, often a good starting point.
kSE(x,x′)=σf2exp(−2l2∣∣x−x′∣∣2)
Properties: It produces very smooth functions (infinitely differentiable). It's stationary, meaning its value depends only on the displacement ∣∣x−x′∣∣, not the absolute location of the points.
ARD: For multi-dimensional inputs x∈RD, you can use Automatic Relevance Determination (ARD) by assigning a separate lengthscale ld to each dimension:
kSE−ARD(x,x′)=σf2exp(−21d=1∑Dld2(xd−xd′)2)
After optimization, dimensions with large ld values are effectively ignored, indicating low relevance for the prediction.
Matérn Kernels
The Matérn family offers more flexibility in controlling the smoothness of the function.
kMateˊrn(x,x′)=σf2Γ(ν)21−ν(2νlr)νKν(2νlr)
where r=∣∣x−x′∣∣, Γ(⋅) is the gamma function, and Kν(⋅) is the modified Bessel function of the second kind.
Properties: The parameter ν>0 controls the differentiability of the function samples. Functions drawn from a GP with a Matérn kernel are k-times differentiable if ν>k. Common choices are ν=3/2 (once differentiable) and ν=5/2 (twice differentiable), which are often more realistic for physical processes than the infinite differentiability of the SE kernel. As ν→∞, the Matérn kernel converges to the SE kernel. The case ν=1/2 yields the Exponential Kernel, which produces continuous but not differentiable functions (rough paths, similar to an Ornstein-Uhlenbeck process). Matérn kernels are stationary.
Hyperparameters: Lengthscale l, output variance σf2, smoothness parameter ν. Often ν is fixed (e.g., to 3/2 or 5/2) rather than optimized.
Periodic Kernel
Ideal for modeling functions that exhibit repetitive patterns.
kPer(x,x′)=σf2exp(−l22sin2(π∣∣x−x′∣∣/p))
Properties: Models functions that repeat with a period p. The lengthscale l here controls the smoothness within a period. It's useful for seasonal data or signals with known frequencies. Stationary.
Hyperparameters: Lengthscale l, output variance σf2, period p.
Properties: Generates functions that are linear trends. It's non-stationary because the covariance depends on the absolute locations x and x′, not just their difference. The parameter c represents an offset. σb2 is a constant offset variance, and σv2 controls the variance of the slope.
Hyperparameters: Offset variance σb2, slope variance σv2, offset c.
Below is a comparison of samples from different kernel types.
Function samples drawn from zero-mean GP priors (σf2=1) using different kernels. The Squared Exponential (SE) produces the smoothest sample. The Matérn 3/2 sample is less smooth (once differentiable). The Periodic sample clearly shows a repeating pattern.
Kernel Engineering: Combining Kernels
The real power of kernels often comes from combining basic kernels to capture more complex structures in the data. If k1 and k2 are valid kernels, then the following combinations are also valid kernels:
Addition:ksum(x,x′)=k1(x,x′)+k2(x,x′)
This corresponds to modeling the function f(x) as the sum of two independent Gaussian Processes, f(x)=f1(x)+f2(x), where f1∼GP(0,k1) and f2∼GP(0,k2). This is extremely useful for modeling additive effects, like a long-term trend plus a seasonal component (kLin+kPer).
Multiplication:kprod(x,x′)=k1(x,x′)×k2(x,x′)
Multiplication can create interesting interactions. For example, multiplying a smooth kernel (like SE) with a periodic kernel (kSE×kPer) can model a periodic function whose amplitude changes smoothly over time. Multiplying kernels can also create non-stationary effects even if the base kernels are stationary.
By carefully combining kernels, you can encode sophisticated prior assumptions about the function's structure directly into the model.
Strategies for Kernel Selection
Choosing the appropriate kernel is a critical step in GP modeling. Here's a general approach:
Leverage Domain Knowledge: What do you know about the underlying process generating the data? Is it expected to be smooth (SE, Matérn)? Does it have cycles (Periodic)? Is there a baseline trend (Linear)? Start with kernels that reflect these assumptions.
Visual Inspection: Plotting the data can often reveal patterns like trends, seasonality, or abrupt changes that suggest suitable kernel structures or combinations.
Start Simple: Begin with standard kernels like SE or Matérn (e.g., Matérn 3/2 or 5/2). These are often flexible enough for many applications.
Consider Combinations: If simple kernels don't capture the structure, think about additive or multiplicative combinations (e.g., Trend + Seasonality, Smoothly varying amplitude).
Hyperparameter Optimization: Regardless of the initial choice, the kernel's hyperparameters (l,σf2,p, etc.) must be tuned to the data. This is typically done by maximizing the marginal likelihood, which we will discuss in the next section.
Model Comparison: If you have several candidate kernels or kernel combinations, use model comparison techniques (like comparing marginal likelihoods or using cross-validation) to select the one that best explains the data.
The selection of the kernel fundamentally shapes the GP model. It defines the space of functions the GP considers plausible before seeing the data. A well-chosen kernel, combined with optimized hyperparameters, allows the GP to effectively learn from the data and make accurate predictions with meaningful uncertainty estimates. Next, we'll look into how to learn these crucial hyperparameters from data.