Projected Gradient Descent (PGD) is a practical and often intuitive algorithm for constrained optimization. This method extends the familiar gradient descent algorithm to handle problems where the solution must lie within a specific feasible set, .
The core idea is remarkably simple: perform a standard gradient descent update, and if this step takes you outside the feasible set , project the resulting point back onto the set. This approach is particularly effective when the projection onto can be computed efficiently.
Recall the general form of a constrained optimization problem that PGD addresses: Here, is the objective function we want to minimize (often a loss function in machine learning), and represents the feasible set defined by the constraints on the parameters . PGD assumes that is a non-empty, closed, and convex set.
Central to PGD is the concept of the projection operator, denoted as . For any point in the parameter space, its projection onto the set is defined as the point in that is closest to in terms of Euclidean distance: Intuitively, finds the "best approximation" of within the allowed region .
The feasibility and efficiency of PGD hinge entirely on how easily this projection can be computed. Fortunately, for many common constraint sets encountered in machine learning, the projection has a simple closed-form solution or can be computed efficiently.
Box Constraints: If the feasible set is defined by element-wise lower and upper bounds, , the projection is straightforward. For a point , the projection is computed by clipping each component: A common special case is non-negativity constraints (), where .
Euclidean Ball (L2 Norm Ball): If the feasible set is a ball of radius centered at the origin, , the projection is: If the point is already inside or on the boundary of the ball, it remains unchanged. If it's outside, it gets scaled down radially to lie on the boundary.
Probability Simplex: If the feasible set requires parameters to be non-negative and sum to one, , the projection is more involved but efficient algorithms exist, often based on sorting and finding an appropriate threshold.
If the projection onto is computationally expensive (e.g., requires solving another optimization problem), then PGD might not be the most practical choice.
The PGD algorithm iteratively updates the parameters using two steps:
Gradient Step: Compute an intermediate point by taking a step in the negative gradient direction, just like standard gradient descent: Here, is the current parameter vector, is the gradient of the objective function at , and is the learning rate at iteration .
Projection Step: Project the intermediate point back onto the feasible set to obtain the next iterate :
This process is repeated until some convergence criterion is met (e.g., the change in or is small, or a maximum number of iterations is reached).
Visualization of a single PGD step. The gray shaded area represents the feasible set . Starting from , a gradient step leads to outside . The projection maps to the nearest point within .
Choosing the learning rate follows similar principles as in unconstrained gradient descent. It can be fixed, follow a decay schedule (e.g., ), or be determined by a line search method adapted for the projection step. A common strategy is backtracking line search, where is reduced until the projected step satisfies certain conditions (e.g., sufficient decrease in ).
When is convex and its gradient is Lipschitz continuous, and the feasible set is closed and convex, PGD is guaranteed to converge to an optimal solution . The convergence rate is typically , similar to standard gradient descent. If is also strongly convex, linear convergence ( for some ) can be achieved.
For non-convex functions , PGD generally converges to a stationary point that satisfies first-order necessary conditions (related to KKT conditions), but global optimality is not guaranteed.
PGD finds use in various machine learning scenarios involving constraints:
Implementing PGD requires:
Pseudo-code for Projected Gradient Descent:
Initialize x_0 within C
Set learning rate strategy (e.g., fixed η or schedule)
k = 0
while not converged:
# Compute gradient
g_k = ∇f(x_k)
# Gradient step
y_{k+1} = x_k - η_k * g_k
# Projection step
x_{k+1} = Π_C(y_{k+1})
# Update iteration count
k = k + 1
Return x_k
The primary limitation of PGD is its reliance on an efficient projection operator. If is defined by complex, coupled constraints (e.g., general linear or non-linear equalities/inequalities), computing the projection can be as hard as solving the original problem. In such cases, other methods like augmented Lagrangian or interior-point methods might be more suitable.
Furthermore, PGD can sometimes exhibit slow convergence, especially if the unconstrained gradient steps frequently point far away from the feasible set, causing the projection step to significantly shorten the effective step length. This can lead to "zig-zagging" behavior along the boundary of .
Despite these limitations, Projected Gradient Descent remains a valuable and widely used tool for optimization problems where parameters are subject to simple constraints, offering a natural extension of gradient descent that explicitly maintains feasibility throughout the optimization process.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with