All Courses

Hands-on Practical: Gaussian Process Modeling

Okay, let's put the theory into practice. Having understood the definition of Gaussian Processes, the role of kernels, and the formulations for regression and classification, we'll now work through implementing these models. We'll use Python and the popular scikit-learn library, which provides convenient implementations for standard GP models. While other libraries like GPy or GPflow offer more advanced features and flexibility, scikit-learn is excellent for getting started with practical applications.

Make sure you have scikit-learn, numpy, and matplotlib (or another plotting library like plotly) installed.

pip install scikit-learn numpy matplotlib plotly

Gaussian Process Regression

Let's start with a common use case: non-linear regression. We want to fit a function to noisy data points and obtain not just point predictions but also a measure of uncertainty about those predictions.

1. Generating Sample Data

First, we need some data. We'll create a synthetic dataset based on a known non-linear function, adding some Gaussian noise. This allows us to visually compare our GP fit to the ground truth.

import numpy as np

# Define the true function
def true_function(X):
    return np.sin(X * 1.5 * np.pi).ravel()

# Generate training data
rng = np.random.RandomState(1)
X_train = rng.rand(30) * 2 - 1 # 30 points between -1 and 1
y_train = true_function(X_train) + rng.randn(30) * 0.2 # Add noise

# Generate test points for prediction
X_test = np.linspace(-1.5, 1.5, 100).reshape(-1, 1)

# Reshape training data for scikit-learn
X_train = X_train.reshape(-1, 1)

2. Defining the Kernel and Model

The choice of kernel is significant as it encodes our prior assumptions about the function (e.g., smoothness, periodicity). A common default choice is the Radial Basis Function (RBF) kernel, also known as the squared exponential kernel. It assumes the function is smooth. We often combine it with a WhiteKernel to account for noise in the observations.

The RBF kernel has a length scale parameter $l$ controlling the smoothness (how quickly correlations decay with distance), and the WhiteKernel has a noise level parameter $\sigma_n^2$ .

k_{\text{RBF}}(x_i, x_j) = \sigma_f^2 \exp\left(-\frac{(x_i - x_j)^2}{2l^2}\right)

k_{\text{White}}(x_i, x_j) = \sigma_n^2 \delta_{ij}

In scikit-learn, we combine kernels using addition. We also add a ConstantKernel (multiplying the RBF) to control the overall variance ( $\sigma_f^2$ ).

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C, WhiteKernel

# Define the kernel: RBF + WhiteKernel
# C() controls the amplitude (variance)
# RBF() provides the smoothness assumption (length_scale)
# WhiteKernel() accounts for noise
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1, (1e-10, 1e1))

# Instantiate the GP model
gp_regressor = GaussianProcessRegressor(
    kernel=kernel,
    n_restarts_optimizer=10, # Run optimizer multiple times
    random_state=42
)

We set bounds on the hyperparameters to guide the optimization process. n_restarts_optimizer helps avoid local optima when fitting the hyperparameters.

3. Fitting the Model

Fitting the GP involves optimizing the kernel hyperparameters (like the RBF length scale, variance, and the noise level) by maximizing the log marginal likelihood of the training data.

# Fit the GP to the training data
gp_regressor.fit(X_train, y_train)

# Print the optimized kernel parameters
print(f"Optimized Kernel: {gp_regressor.kernel_}")
# Example Output: Optimized Kernel: 1**2 * RBF(length_scale=0.406) + WhiteKernel(noise_level=0.0416)

The fitting process found hyperparameters that best explain the observed data according to the model structure defined by the kernel.

4. Making Predictions

Now we can use the fitted model to make predictions on our test points. Importantly, the GP provides both the mean prediction and the standard deviation (which represents the uncertainty).

# Predict on test points
y_pred_mean, y_pred_std = gp_regressor.predict(X_test, return_std=True)

5. Visualizing the Results

Let's visualize the GP fit, including the training data, the true function (if known), the GP mean prediction, and the confidence interval (typically mean ± 1.96 * standard deviation for a 95% confidence interval).

GP regression fit to noisy sinusoidal data. The blue line is the GP mean prediction, the red dots are training data, the gray dashed line is the true underlying function, and the shaded blue area represents the 95% confidence interval.

Observe how the confidence interval (uncertainty) is smaller near the training data points and grows larger in regions where data is sparse. This is a fundamental property and advantage of GPs: they naturally quantify their own prediction uncertainty.

Gaussian Process Classification

Now, let's tackle a binary classification problem. The challenge here is that the output is discrete (e.g., 0 or 1), not continuous, so the likelihood function is non-Gaussian (typically Bernoulli). This prevents exact Bayesian inference. scikit-learn's GaussianProcessClassifier uses the Laplace approximation internally to handle this.

1. Generating Sample Data

We'll create a simple 1D dataset where the class depends non-linearly on the input feature.

# Generate 1D classification data
rng = np.random.RandomState(3)
X_class = rng.rand(80, 1) * 4 - 2 # 80 points between -2 and 2
# Class probability depends non-linearly on X
p_class1 = 1 / (1 + np.exp(-np.sin(X_class.ravel() * 2)))
y_class = (rng.rand(80) < p_class1).astype(int) # Assign class 0 or 1 based on probability

# Test points for visualization
X_test_class = np.linspace(-2.5, 2.5, 100).reshape(-1, 1)

2. Defining the Kernel and Model

Similar to regression, we need a kernel. An RBF kernel is often a reasonable starting point for classification tasks as well, assuming some smoothness in the underlying latent function that determines class probabilities.

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

# Define kernel (RBF often works well)
kernel_class = C(1.0, (1e-2, 1e2)) * RBF(1.0, (1e-1, 1e1))

# Instantiate the GP classifier
# Laplace approximation is used implicitly
gp_classifier = GaussianProcessClassifier(
    kernel=kernel_class,
    n_restarts_optimizer=10,
    random_state=42
)

3. Fitting the Model

Fitting involves optimizing hyperparameters based on the approximate marginal likelihood (derived from the Laplace approximation).

# Fit the GP classifier
gp_classifier.fit(X_class, y_class)

# Print optimized kernel
print(f"Optimized Kernel (Classification): {gp_classifier.kernel_}")
# Example Output: Optimized Kernel (Classification): 2.91**2 * RBF(length_scale=1.04)

4. Making Predictions (Probabilities)

For classification, we're often interested in the probability of belonging to each class.

# Predict class probabilities on test points
y_prob_class = gp_classifier.predict_proba(X_test_class)

5. Visualizing the Results

We can plot the training data points and the predicted probability of belonging to class 1.

GP classification results. Dots represent training data points (red circles for class 0, teal crosses for class 1). The blue line shows the predicted probability of belonging to class 1.

The GP classifier provides a smooth probability transition between classes. Regions where the probability is close to 0.5 indicate higher uncertainty about the classification.

Considerations for Practice

Kernel Selection: Choosing the right kernel (or combination of kernels) is essential. It encodes prior knowledge. If you expect periodicity, use a ExpSineSquared kernel. If you expect linear trends, add a DotProduct kernel. Combining kernels (+ for addition, * for multiplication) allows building complex priors. Hyperparameter optimization helps tune the chosen kernel structure, but the initial structure selection relies on domain knowledge or exploration.
Scalability: Standard GP implementations have computational complexity $O(N^3)$ for training and $O(N^2)$ for prediction, where $N$ is the number of training points. This becomes prohibitive for large datasets (N > few thousand). The approximation methods discussed theoretically (like sparse GPs using inducing points) are necessary for larger scales. Libraries like GPflow (based on TensorFlow) or GPyTorch (based on PyTorch) are specifically designed with these scalable approximations in mind.
Hyperparameter Optimization: While scikit-learn automates this, understanding that it maximizes the (log) marginal likelihood is important. The n_restarts_optimizer parameter is valuable because the optimization can have multiple local optima.
Interpretation: The primary benefit of GPs, with flexible non-linear modeling, is the principled uncertainty quantification. Always examine the variance or standard deviation in regression, or the probabilities in classification, to understand the model's confidence.

This practical session demonstrated how to implement GP regression and classification using scikit-learn. You've seen how to define kernels, fit models, make predictions with uncertainty, and visualize the results. Remember that while these examples used simple kernels and datasets, the power of GPs lies in their flexibility for more complex problems, especially when combined with thoughtful kernel engineering and, for larger datasets, appropriate approximation techniques.

Was this section helpful?