While the optimization algorithms we've discussed extensively rely on gradient information, ∇f(x), and sometimes Hessian information, ∇2f(x), there are important scenarios in machine learning and related fields where this information is simply not available or practical to obtain. This section provides an overview of Derivative-Free Optimization (DFO), also known as Black-Box Optimization (BBO), which tackles problems using only function evaluations, f(x).
Imagine trying to tune the parameters of a complex simulation model, perhaps modeling climate change or financial markets. Each run of the simulation with a given parameter set yields an output (our objective function value), but calculating the gradient of this output with respect to the input parameters might be impossible or computationally prohibitive. Similarly, consider optimizing hyperparameters for a machine learning model where the objective is the validation accuracy after a full training run; the function relating hyperparameters to accuracy is expensive to evaluate, and its derivatives are not readily available.
These situations demand optimization techniques that explore the parameter space and make decisions based solely on the observed values of the objective function at various points.
The Core Idea: Learning from Function Values Alone
DFO methods operate under the constraint that they can query the objective function f at any point x to get the value f(x), but they cannot access ∇f(x) or ∇2f(x). The fundamental challenge is to intelligently choose the next point(s) to evaluate to make progress towards a minimum, using only the history of previously evaluated points and their corresponding function values.
Think of it like navigating a foggy landscape to find the lowest point. You only know your current altitude (f(x)) and the altitude at other spots you've visited. You must decide where to step next based on this limited information. Naturally, progress might be slower and less direct compared to having a gradient pointing downhill.
Categories of Derivative-Free Methods
DFO encompasses a diverse set of algorithms, often broadly categorized as follows:
1. Direct Search Methods
These methods maintain and update a set of points in the search space based on comparing their function values. They typically follow specific geometric rules for exploring the vicinity of the current best point.
- Nelder-Mead Method (Simplex Method): This classic heuristic algorithm maintains a simplex, which is a geometric shape defined by n+1 vertices in an n-dimensional space (e.g., a triangle in 2D, a tetrahedron in 3D). The algorithm iteratively modifies the simplex by replacing the vertex with the worst function value through operations like reflection, expansion, contraction, and shrinking, attempting to move the simplex towards lower function values. While widely used due to its simplicity, Nelder-Mead can sometimes stagnate or converge to non-stationary points, especially in higher dimensions. It lacks strong theoretical convergence guarantees for general functions.
- Pattern Search Methods (e.g., Generalized Pattern Search, GPS): These methods are more structured than Nelder-Mead. They explore points on a predefined mesh or pattern around the current best point. If a point with a lower function value is found during this "polling" step, the center of the pattern moves to that point (successful step). If no improvement is found, the mesh size is typically reduced (unsuccessful step), allowing for finer exploration later. Pattern search methods often have better theoretical convergence properties compared to Nelder-Mead, guaranteeing convergence to stationary points under certain conditions on the function and the polling strategy.
2. Model-Based Methods
Instead of directly manipulating a set of points geometrically, model-based methods build a local surrogate model (often quadratic) that approximates the true objective function f(x) around the current point or region. This model is differentiable and easy to optimize.
The general procedure involves:
- Evaluate f(x) at a set of initial points.
- Fit a surrogate model (e.g., using interpolation or regression) to these points (xi,f(xi)).
- Find the minimum of the surrogate model, possibly within a "trust region" where the model is believed to be accurate.
- Evaluate the true function f(x) at the point suggested by the surrogate model minimum.
- Update the set of evaluated points and refine the surrogate model.
- Repeat.
These methods aim to use the function evaluations more efficiently by building an understanding of the function's local shape. Techniques like Trust-Region Methods for DFO fall into this category. Bayesian Optimization, which we cover later, is a sophisticated probabilistic form of model-based optimization.
3. Stochastic and Heuristic Methods
This broad category includes methods that incorporate randomness in their search strategy. They are often inspired by natural phenomena and can be effective for global optimization or navigating very rugged, non-smooth landscapes.
- Simulated Annealing (SA): Inspired by the annealing process in metallurgy, SA starts exploring the space broadly and gradually focuses its search. It probabilistically accepts moves to points with higher function values, especially early on (high "temperature"), allowing it to escape local minima. As the process "cools," the probability of accepting worse moves decreases, making the search more greedy.
- Evolutionary Algorithms (EAs), including Genetic Algorithms (GAs): These methods maintain a "population" of candidate solutions. In each generation, solutions are selected based on their "fitness" (function value). New solutions are created through "crossover" (combining parts of existing solutions) and "mutation" (random changes). Over time, the population tends to evolve towards better solutions.
- Particle Swarm Optimization (PSO): This method simulates social behavior, like a flock of birds searching for food. A population of "particles" moves through the parameter space. Each particle's movement is influenced by its own best-found position and the best-found position discovered by the entire swarm.
Stochastic methods can be powerful but often lack formal convergence guarantees, may require many function evaluations, and their performance can be sensitive to the tuning of algorithm-specific parameters (e.g., population size, mutation rate, cooling schedule).
When to Use Derivative-Free Optimization
DFO methods are most appropriate when:
- Gradients are unavailable: The objective function comes from a black-box simulation, an external system, or a legacy code base where differentiation is impossible.
- Gradients are unreliable or noisy: The function evaluations themselves have significant noise (e.g., from Monte Carlo simulations or physical experiments), making finite-difference approximations of gradients inaccurate.
- Gradient computation is prohibitively expensive: Even if theoretically possible, computing the gradient takes far more resources than evaluating the function itself.
- The objective function is non-smooth or discontinuous: While subgradient methods exist, DFO can sometimes handle functions with jumps or kinks more directly, although convergence guarantees may be weaker.
- Dealing with discrete or integer parameters: Some DFO methods can be adapted to handle non-continuous parameter spaces.
- Hyperparameter optimization: As mentioned, finding optimal hyperparameters for ML models is a common application where the objective function evaluation is costly and gradients are usually not available.
Limitations and Considerations
While valuable, DFO methods come with significant caveats:
- Scalability (Curse of Dimensionality): Performance degrades rapidly as the number of parameters (n) increases. Direct search methods become exponentially slow, and model-based methods require a large number of points (often scaling quadratically with n or worse) to build accurate models. DFO is typically feasible only for problems with a relatively small number of parameters (up to tens or maybe a few hundred, depending on the method and problem structure).
- Convergence Rate: When gradients are available and the function is reasonably well-behaved, gradient-based methods are almost always significantly faster and more efficient than DFO methods. DFO methods typically exhibit linear or sublinear convergence at best, often much slower in practice.
- Tuning: Many DFO algorithms, especially stochastic ones, have several internal parameters that need careful tuning for good performance.
- Local vs. Global Optimization: Many direct search and model-based methods are inherently local optimizers, similar to gradient descent. Stochastic methods may offer better global exploration capabilities but often at the cost of slower convergence near optima.
In summary, Derivative-Free Optimization provides a necessary set of tools for problems where gradient information cannot be exploited. Understanding the different approaches, direct search, model-based, and stochastic methods, and their respective strengths and weaknesses allows you to select an appropriate technique when faced with black-box optimization challenges, particularly in areas like simulation-based optimization and hyperparameter tuning. However, always be mindful of their limitations regarding scalability and convergence speed compared to gradient-based alternatives.