While previous sections dealt with optimizing model parameters directly, often using gradients, another significant optimization challenge in machine learning is tuning hyperparameters. Think of learning rates, regularization strengths, or the number of layers in a neural network. Evaluating a single set of hyperparameters can be very expensive, involving training and validating a potentially complex model. Furthermore, the relationship between hyperparameters and model performance is typically a "black box", meaning we don't have an explicit analytical form, let alone gradients. Brute-force approaches like grid search scale exponentially poorly with the number of hyperparameters, and random search, while often better, can be inefficient.
This is where Bayesian Optimization (BO) provides a more sample-efficient approach. It's particularly well-suited for optimizing expensive-to-evaluate, black-box functions – exactly the situation we face with hyperparameter tuning.
Instead of blindly evaluating hyperparameter combinations, Bayesian Optimization builds a probabilistic model of the objective function (e.g., validation accuracy as a function of hyperparameters). This model, often called a surrogate model, attempts to capture our current understanding of how different hyperparameter settings affect performance based on the points evaluated so far. Crucially, this model also quantifies its uncertainty about the function's behavior in unexplored regions of the hyperparameter space.
BO then uses an acquisition function, derived from the surrogate model, to decide which set of hyperparameters to evaluate next. The acquisition function balances two competing goals:
By intelligently balancing exploration and exploitation, BO aims to find the optimal hyperparameters using significantly fewer evaluations than grid search or random search.
Let's break down the two main components:
The most common choice for the surrogate model is a Gaussian Process (GP). A GP is a powerful, non-parametric model that defines a distribution over functions. Instead of fitting a single function, a GP defines a prior belief about the function's properties (like smoothness) through its mean function and covariance function (also called the kernel).
Given a set of observed data points (X,y), where X represents the evaluated hyperparameter sets and y represents the corresponding performance scores, the GP can be updated using Bayes' theorem to form a posterior distribution. This posterior provides:
The uncertainty is typically lower near observed data points and higher in regions far from any observations. The choice of kernel (e.g., RBF, Matérn) encodes assumptions about the function being modeled and influences the GP's behavior.
A Gaussian Process fitted to four observed hyperparameter evaluations. The blue line shows the mean prediction of the objective function, and the shaded blue area indicates the uncertainty (e.g., 95% confidence interval). Uncertainty is lower near the observed points (black markers).
The acquisition function uses the GP's predictions (μ(x)) and uncertainty (σ(x)) to quantify the "desirability" of evaluating any given point x next. Finding the point x that maximizes the acquisition function gives us the next candidate hyperparameters to evaluate on the actual expensive objective function. Optimizing the acquisition function itself is usually much cheaper than evaluating the objective function.
Common acquisition functions include:
Expected Improvement (EI): This is perhaps the most popular choice. It calculates the expected amount of improvement over the best performance f(x+) found so far. Let y=f(x) be the performance at point x. EI is defined as: EI(x)=E[max(0,f(x)−f(x+))] Using the GP posterior f(x)∼N(μ(x),σ2(x)), this expectation can often be computed analytically. EI naturally balances exploitation (points with high μ(x)) and exploration (points with high σ(x)). A small positive value ξ is sometimes added, EI(x)=E[max(0,f(x)−f(x+)−ξ)], to encourage more exploration.
Probability of Improvement (PI): This function calculates the probability that a point x will yield a performance better than the current best f(x+) (plus a small margin ξ): PI(x)=P(f(x)≥f(x+)+ξ)=Φ(σ(x)μ(x)−f(x+)−ξ) where Φ is the standard normal CDF. PI tends to be more exploitative than EI.
Upper Confidence Bound (UCB): This function directly balances the mean prediction and uncertainty using a trade-off parameter κ: UCB(x)=μ(x)+κσ(x) Higher values of κ favor exploration, while lower values favor exploitation. Theoretical guarantees often exist for UCB under certain conditions.
The choice of acquisition function can influence the search strategy and overall performance of BO.
The process unfolds iteratively:
The final recommended hyperparameters are typically the set x that yielded the best observed performance f(x) during the search.
Advantages:
Considerations:
You don't need to implement BO from scratch. Several excellent libraries are available in Python:
skopt
): Provides a user-friendly interface for BO (gp_minimize
) and other optimization algorithms.Integrating these libraries typically involves defining your objective function (which takes hyperparameters and returns a score to be maximized/minimized) and the search space for your hyperparameters. The library then handles the BO loop internally.
Bayesian Optimization represents a significant step up from simpler hyperparameter search strategies, offering a powerful tool for efficiently tuning complex machine learning models when evaluations are costly.
© 2025 ApX Machine Learning