While the previous section introduced the theoretical underpinnings of constrained optimization using Lagrange multipliers and KKT conditions, we now turn our attention to a practical and often intuitive algorithm: Projected Gradient Descent (PGD). This method extends the familiar gradient descent algorithm to handle problems where the solution must lie within a specific feasible set, C.
The core idea is remarkably simple: perform a standard gradient descent update, and if this step takes you outside the feasible set C, project the resulting point back onto the set. This approach is particularly effective when the projection onto C can be computed efficiently.
Recall the general form of a constrained optimization problem that PGD addresses: minxf(x)subject tox∈C Here, f(x) is the objective function we want to minimize (often a loss function in machine learning), and C represents the feasible set defined by the constraints on the parameters x. PGD assumes that C is a non-empty, closed, and convex set.
Central to PGD is the concept of the projection operator, denoted as ΠC. For any point y in the parameter space, its projection onto the set C is defined as the point in C that is closest to y in terms of Euclidean distance: ΠC(y)=argminx∈C∣∣x−y∣∣22 Intuitively, ΠC(y) finds the "best approximation" of y within the allowed region C.
The feasibility and efficiency of PGD hinge entirely on how easily this projection can be computed. Fortunately, for many common constraint sets encountered in machine learning, the projection has a simple closed-form solution or can be computed efficiently.
Box Constraints: If the feasible set is defined by element-wise lower and upper bounds, C={x∣li≤xi≤ui for all i}, the projection is straightforward. For a point y, the projection x=ΠC(y) is computed by clipping each component: xi=max(li,min(yi,ui)) A common special case is non-negativity constraints (li=0,ui=∞), where xi=max(0,yi).
Euclidean Ball (L2 Norm Ball): If the feasible set is a ball of radius R centered at the origin, C={x∣∣∣x∣∣2≤R}, the projection is: ΠC(y)={yR∣∣y∣∣2yif ∣∣y∣∣2≤Rif ∣∣y∣∣2>R If the point y is already inside or on the boundary of the ball, it remains unchanged. If it's outside, it gets scaled down radially to lie on the boundary.
Probability Simplex: If the feasible set requires parameters to be non-negative and sum to one, C={x∣∑ixi=1,xi≥0}, the projection is more involved but efficient algorithms exist, often based on sorting and finding an appropriate threshold.
If the projection onto C is computationally expensive (e.g., requires solving another optimization problem), then PGD might not be the most practical choice.
The PGD algorithm iteratively updates the parameters x using two steps:
Gradient Step: Compute an intermediate point yk+1 by taking a step in the negative gradient direction, just like standard gradient descent: yk+1=xk−ηk∇f(xk) Here, xk is the current parameter vector, ∇f(xk) is the gradient of the objective function at xk, and ηk is the learning rate at iteration k.
Projection Step: Project the intermediate point yk+1 back onto the feasible set C to obtain the next iterate xk+1: xk+1=ΠC(yk+1)
This process is repeated until some convergence criterion is met (e.g., the change in x or f(x) is small, or a maximum number of iterations is reached).
Visualization of a single PGD step. The gray shaded area represents the feasible set C. Starting from xk, a gradient step leads to yk+1 outside C. The projection ΠC maps yk+1 to the nearest point xk+1 within C.
Choosing the learning rate ηk follows similar principles as in unconstrained gradient descent. It can be fixed, follow a decay schedule (e.g., ηk=η0/k), or be determined by a line search method adapted for the projection step. A common strategy is backtracking line search, where η is reduced until the projected step satisfies certain conditions (e.g., sufficient decrease in f(x)).
When f is convex and its gradient is Lipschitz continuous, and the feasible set C is closed and convex, PGD is guaranteed to converge to an optimal solution x∗∈C. The convergence rate is typically O(1/k), similar to standard gradient descent. If f is also strongly convex, linear convergence (O(ρk) for some ρ<1) can be achieved.
For non-convex functions f, PGD generally converges to a stationary point that satisfies first-order necessary conditions (related to KKT conditions), but global optimality is not guaranteed.
PGD finds use in various machine learning scenarios involving constraints:
Implementing PGD requires:
Pseudo-code for Projected Gradient Descent:
Initialize x_0 within C
Set learning rate strategy (e.g., fixed η or schedule)
k = 0
while not converged:
# Compute gradient
g_k = ∇f(x_k)
# Gradient step
y_{k+1} = x_k - η_k * g_k
# Projection step
x_{k+1} = Π_C(y_{k+1})
# Update iteration count
k = k + 1
Return x_k
The primary limitation of PGD is its reliance on an efficient projection operator. If C is defined by complex, coupled constraints (e.g., general linear or non-linear equalities/inequalities), computing the projection can be as hard as solving the original problem. In such cases, other methods like augmented Lagrangian or interior-point methods might be more suitable.
Furthermore, PGD can sometimes exhibit slow convergence, especially if the unconstrained gradient steps frequently point far away from the feasible set, causing the projection step to significantly shorten the effective step length. This can lead to "zig-zagging" behavior along the boundary of C.
Despite these limitations, Projected Gradient Descent remains a valuable and widely used tool for optimization problems where parameters are subject to simple constraints, offering a natural extension of gradient descent that explicitly maintains feasibility throughout the optimization process.
© 2025 ApX Machine Learning