While much of machine learning optimization focuses on minimizing a loss function without explicit restrictions on the parameters (beyond perhaps implicit ones from regularization), many practical scenarios involve constraints. These are conditions that the solution must satisfy. Ignoring constraints can lead to solutions that are impractical, invalid, or fail to meet specific requirements of the problem.
This section introduces the fundamental concepts of constrained optimization, setting the stage for the techniques discussed later in this chapter.
Why Constraints?
Constraints arise naturally in various machine learning contexts:
- Resource Limitations: Allocating a fixed budget across different features or models. Assigning tasks to workers with limited capacity.
- Physical or Logical Requirements: Ensuring probabilities sum to 1, keeping parameters within physically meaningful bounds (e.g., positive variances), or enforcing specific relationships between variables.
- Fairness and Bias Mitigation: Imposing constraints on model predictions to ensure parity across different demographic groups.
- Interpretability and Sparsity: While often handled via L1 regularization, sparsity can be framed as a constraint on the number of non-zero parameters (though this specific constraint is non-convex and computationally hard). Relatedly, LASSO regression minimizes squared error subject to an L1-norm constraint on the weights.
- Safety Guarantees: In reinforcement learning or control systems, ensuring actions stay within safe operating limits.
Mathematical Formulation
A general constrained optimization problem can be written in the following standard form:
x∈Rnminsubject tof(x)gi(x)≤0,i=1,…,mhj(x)=0,j=1,…,p
Let's break down the components:
- x∈Rn: This is the vector of decision variables we want to optimize. In machine learning, x typically represents the model parameters (weights, biases).
- f(x):Rn→R: This is the objective function that we aim to minimize. Usually, this is the loss function (e.g., mean squared error, cross-entropy).
- gi(x)≤0: These are the m inequality constraints. Each gi(x) is a function gi:Rn→R. These constraints define conditions that must be met in terms of "less than or equal to". Note that a constraint like a(x)≥b can be rewritten as b−a(x)≤0.
- hj(x)=0: These are the p equality constraints. Each hj(x) is a function hj:Rn→R. These specify exact conditions that must hold.
The set of all points x that satisfy all the inequality and equality constraints is called the feasible region, often denoted by F:
F={x∈Rn∣gi(x)≤0 for all i=1,…,m, and hj(x)=0 for all j=1,…,p}
The goal of constrained optimization is to find a point x∗∈F such that f(x∗)≤f(x) for all x∈F.
The Feasible Region
The nature of the feasible region is critically important. If the constraint functions gi and hj are simple (e.g., linear), the feasible region might be a geometrically simple shape like a polygon or polyhedron. If they are complex and non-linear, the feasible region can have a complicated structure.
Consider minimizing f(x1,x2)=(x1−2)2+(x2−2)2 (whose unconstrained minimum is at (2,2)) subject to x1+x2≤1, x1≥0, and x2≥0. The feasible region is a triangle.
The feasible region (shaded gray) is defined by the constraints x1≥0, x2≥0, and x1+x2≤1. The blue contours represent the objective function f(x1,x2)=(x1−2)2+(x2−2)2. The unconstrained minimum is at (2,2) (red 'x'), outside the feasible region. The constrained minimum occurs at (0.5,0.5) (green star), on the boundary of the feasible region.
Challenges Introduced by Constraints
Constraints fundamentally change the optimization process:
- Optimality Conditions: In unconstrained optimization, a necessary condition for a local minimum (for differentiable functions) is that the gradient is zero (∇f(x)=0). This is no longer sufficient or even necessary for constrained problems. As seen in the plot above, the constrained minimum (0.5,0.5) occurs where the gradient is clearly non-zero. The optimal point might lie on the boundary of the feasible region.
- Maintaining Feasibility: Standard iterative methods like gradient descent might generate steps that take the parameters outside the feasible region. Algorithms must incorporate mechanisms to handle constraints, either by projecting iterates back onto the feasible set or by modifying the search direction.
Understanding this basic setup is essential before exploring methods designed to solve these problems, such as those based on Lagrange multipliers (covered next) or projection techniques.