Okay, let's put the theory into practice. Having understood the definition of Gaussian Processes, the role of kernels, and the formulations for regression and classification, we'll now work through implementing these models. We'll use Python and the popular scikit-learn
library, which provides convenient implementations for standard GP models. While other libraries like GPy
or GPflow
offer more advanced features and flexibility, scikit-learn
is excellent for getting started with practical applications.
Make sure you have scikit-learn
, numpy
, and matplotlib
(or another plotting library like plotly
) installed.
pip install scikit-learn numpy matplotlib plotly
Let's start with a common use case: non-linear regression. We want to fit a function to noisy data points and obtain not just point predictions but also a measure of uncertainty about those predictions.
First, we need some data. We'll create a synthetic dataset based on a known non-linear function, adding some Gaussian noise. This allows us to visually compare our GP fit to the ground truth.
import numpy as np
# Define the true function
def true_function(X):
return np.sin(X * 1.5 * np.pi).ravel()
# Generate training data
rng = np.random.RandomState(1)
X_train = rng.rand(30) * 2 - 1 # 30 points between -1 and 1
y_train = true_function(X_train) + rng.randn(30) * 0.2 # Add noise
# Generate test points for prediction
X_test = np.linspace(-1.5, 1.5, 100).reshape(-1, 1)
# Reshape training data for scikit-learn
X_train = X_train.reshape(-1, 1)
The choice of kernel is significant as it encodes our prior assumptions about the function (e.g., smoothness, periodicity). A common default choice is the Radial Basis Function (RBF) kernel, also known as the squared exponential kernel. It assumes the function is smooth. We often combine it with a WhiteKernel
to account for noise in the observations.
The RBF kernel has a length scale parameter l controlling the smoothness (how quickly correlations decay with distance), and the WhiteKernel
has a noise level parameter σn2.
In scikit-learn
, we combine kernels using addition. We also add a ConstantKernel
(multiplying the RBF) to control the overall variance (σf2).
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C, WhiteKernel
# Define the kernel: RBF + WhiteKernel
# C() controls the amplitude (variance)
# RBF() provides the smoothness assumption (length_scale)
# WhiteKernel() accounts for noise
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2)) + WhiteKernel(0.1, (1e-10, 1e1))
# Instantiate the GP model
gp_regressor = GaussianProcessRegressor(
kernel=kernel,
n_restarts_optimizer=10, # Run optimizer multiple times
random_state=42
)
We set bounds on the hyperparameters to guide the optimization process. n_restarts_optimizer
helps avoid local optima when fitting the hyperparameters.
Fitting the GP involves optimizing the kernel hyperparameters (like the RBF length scale, variance, and the noise level) by maximizing the log marginal likelihood of the training data.
# Fit the GP to the training data
gp_regressor.fit(X_train, y_train)
# Print the optimized kernel parameters
print(f"Optimized Kernel: {gp_regressor.kernel_}")
# Example Output: Optimized Kernel: 1**2 * RBF(length_scale=0.406) + WhiteKernel(noise_level=0.0416)
The fitting process found hyperparameters that best explain the observed data according to the model structure defined by the kernel.
Now we can use the fitted model to make predictions on our test points. Importantly, the GP provides both the mean prediction and the standard deviation (which represents the uncertainty).
# Predict on test points
y_pred_mean, y_pred_std = gp_regressor.predict(X_test, return_std=True)
Let's visualize the GP fit, including the training data, the true function (if known), the GP mean prediction, and the confidence interval (typically mean ± 1.96 * standard deviation for a 95% confidence interval).
GP regression fit to noisy sinusoidal data. The blue line is the GP mean prediction, the red dots are training data, the gray dashed line is the true underlying function, and the shaded blue area represents the 95% confidence interval.
Observe how the confidence interval (uncertainty) is smaller near the training data points and grows larger in regions where data is sparse. This is a fundamental property and advantage of GPs: they naturally quantify their own prediction uncertainty.
Now, let's tackle a binary classification problem. The challenge here is that the output is discrete (e.g., 0 or 1), not continuous, so the likelihood function is non-Gaussian (typically Bernoulli). This prevents exact Bayesian inference. scikit-learn
's GaussianProcessClassifier
uses the Laplace approximation internally to handle this.
We'll create a simple 1D dataset where the class depends non-linearly on the input feature.
# Generate 1D classification data
rng = np.random.RandomState(3)
X_class = rng.rand(80, 1) * 4 - 2 # 80 points between -2 and 2
# Class probability depends non-linearly on X
p_class1 = 1 / (1 + np.exp(-np.sin(X_class.ravel() * 2)))
y_class = (rng.rand(80) < p_class1).astype(int) # Assign class 0 or 1 based on probability
# Test points for visualization
X_test_class = np.linspace(-2.5, 2.5, 100).reshape(-1, 1)
Similar to regression, we need a kernel. An RBF kernel is often a reasonable starting point for classification tasks as well, assuming some smoothness in the underlying latent function that determines class probabilities.
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
# Define kernel (RBF often works well)
kernel_class = C(1.0, (1e-2, 1e2)) * RBF(1.0, (1e-1, 1e1))
# Instantiate the GP classifier
# Laplace approximation is used implicitly
gp_classifier = GaussianProcessClassifier(
kernel=kernel_class,
n_restarts_optimizer=10,
random_state=42
)
Fitting involves optimizing hyperparameters based on the approximate marginal likelihood (derived from the Laplace approximation).
# Fit the GP classifier
gp_classifier.fit(X_class, y_class)
# Print optimized kernel
print(f"Optimized Kernel (Classification): {gp_classifier.kernel_}")
# Example Output: Optimized Kernel (Classification): 2.91**2 * RBF(length_scale=1.04)
For classification, we're often interested in the probability of belonging to each class.
# Predict class probabilities on test points
y_prob_class = gp_classifier.predict_proba(X_test_class)
We can plot the training data points and the predicted probability of belonging to class 1.
GP classification results. Dots represent training data points (red circles for class 0, teal crosses for class 1). The blue line shows the predicted probability of belonging to class 1.
The GP classifier provides a smooth probability transition between classes. Regions where the probability is close to 0.5 indicate higher uncertainty about the classification.
ExpSineSquared
kernel. If you expect linear trends, add a DotProduct
kernel. Combining kernels (+
for addition, *
for multiplication) allows building complex priors. Hyperparameter optimization helps tune the chosen kernel structure, but the initial structure selection relies on domain knowledge or exploration.GPflow
(based on TensorFlow) or GPyTorch
(based on PyTorch) are specifically designed with these scalable approximations in mind.scikit-learn
automates this, understanding that it maximizes the (log) marginal likelihood is important. The n_restarts_optimizer
parameter is valuable because the optimization landscape can have multiple local optima.This practical session demonstrated how to implement GP regression and classification using scikit-learn
. You've seen how to define kernels, fit models, make predictions with uncertainty, and visualize the results. Remember that while these examples used simple kernels and datasets, the power of GPs lies in their flexibility for more complex problems, especially when combined with thoughtful kernel engineering and, for larger datasets, appropriate approximation techniques.
© 2025 ApX Machine Learning