While Trust Region Policy Optimization (TRPO) provides strong theoretical guarantees for monotonic policy improvement by constraining updates using the KL divergence, its practical implementation can be complex. TRPO often involves calculating second-order derivatives (the Fisher Information Matrix) and solving constrained optimization problems, typically using the conjugate gradient method. This complexity motivated the development of Proximal Policy Optimization (PPO), an algorithm that aims to achieve the data efficiency and reliable performance of TRPO using only first-order optimization, making it significantly simpler to implement and tune. PPO has become a go-to algorithm in many deep reinforcement learning applications due to its balance of performance, simplicity, and stability.
PPO belongs to the family of policy gradient methods and operates within the actor-critic framework. Like TRPO, it seeks to limit how much the policy changes during each update step to avoid performance collapse. Instead of imposing a strict KL divergence constraint, PPO typically uses a clipped surrogate objective or an adaptive KL penalty to discourage large policy updates.
The Surrogate Objective
Recall the standard policy gradient objective function, which aims to maximize the expected discounted return. We often work with a surrogate objective that involves the probability ratio between the new policy πθ and the old policy πθold used to collect the data:
rt(θ)=πθold(at∣st)πθ(at∣st)
The standard surrogate objective, sometimes called the CPI (Conservative Policy Iteration) objective, is:
LCPI(θ)=E^t[rt(θ)A^t]
Here, E^t denotes the empirical average over a batch of collected timesteps, and A^t is an estimate of the advantage function at timestep t (often computed using Generalized Advantage Estimation, GAE, as discussed previously). Maximizing LCPI(θ) directly can lead to excessively large policy updates, especially when rt(θ) deviates significantly from 1.
PPO Variant 1: Clipped Surrogate Objective
This is the most common variant of PPO. It modifies the surrogate objective by clipping the probability ratio rt(θ) to keep it within a small interval around 1. The clipping depends on the sign of the advantage function A^t.
The objective function is defined as:
LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
Let's break down the min term:
- First term: rt(θ)A^t is the standard CPI objective.
- Second term: clip(rt(θ),1−ϵ,1+ϵ)A^t. The
clip
function constrains the ratio rt(θ) to lie within the range [1−ϵ,1+ϵ]. The hyperparameter ϵ (epsilon) defines the clipping range (e.g., typically 0.1 or 0.2).
The min operation ensures that the final objective is a lower bound (a pessimistic estimate) of the unclipped objective.
- If A^t>0 (advantage is positive): The objective becomes min(rt(θ)A^t,(1+ϵ)A^t). This means if the policy update would increase the probability ratio rt(θ) beyond 1+ϵ, the gradient is clipped, preventing the policy from moving too far. We only benefit from increasing rt(θ) up to 1+ϵ.
- If A^t<0 (advantage is negative): The objective becomes min(rt(θ)A^t,(1−ϵ)A^t). Since A^t is negative, this is equivalent to taking max(rt(θ)A^t,(1−ϵ)A^t). This prevents the policy update from decreasing the probability ratio rt(θ) below 1−ϵ. We only suffer the penalty from decreasing rt(θ) down to 1−ϵ.
Essentially, the clipping removes the incentive for the policy to change too dramatically in a single update, indirectly limiting the KL divergence between the old and new policies.
The effect of the clipping mechanism in PPO for positive and negative advantages (A^t). The dashed lines show how the contribution to the objective is capped when the probability ratio rt(θ) moves too far from 1.0.
PPO Variant 2: Adaptive KL Penalty
An alternative formulation of PPO uses a penalty on the KL divergence rather than a hard clip. The objective function is:
LKLPEN(θ)=E^t[rt(θ)A^t−βKL[πθold(⋅∣st)∣∣πθ(⋅∣st)]]
Here, β is a coefficient that penalizes large deviations in the policy, measured by the KL divergence between the old and new policies. Instead of being fixed, β is typically adapted dynamically during training:
- Compute the actual KL divergence dKL=E^t[KL[πθold(⋅∣st)∣∣πθ(⋅∣st)]] after a policy update.
- If dKL is larger than a target value dtarg, increase β (to penalize KL divergence more heavily in the future).
- If dKL is smaller than dtarg, decrease β (to allow for larger updates).
While this approach is closer in spirit to the original TRPO constraint, the adaptive scheme for β adds complexity, and empirically, the clipped surrogate objective often performs just as well and is easier to implement and tune.
The Full PPO Algorithm
PPO is typically implemented using an actor-critic architecture. The final objective function optimized often combines the clipped policy surrogate loss (LCLIP) with a value function loss (LVF) and an entropy bonus (S) to encourage exploration:
LPPO(θ)=E^t[LCLIP(θ)−c1LVF(θ)+c2S[πθ](st)]
- LVF(θ)=(Vθ(st)−Vttarg)2 is the squared error loss for the value function (critic). Vθ(st) is the critic's estimate, and Vttarg is the target value, often the empirical return or the GAE estimate itself.
- S[πθ](st) is the entropy of the policy πθ at state st. Maximizing entropy encourages exploration.
- c1 and c2 are hyperparameters weighting the value loss and entropy bonus, respectively.
The overall algorithm proceeds as follows:
- Initialize: Initialize actor policy network πθ and critic value network Vϕ (parameters θ,ϕ). Often, some layers are shared.
- Loop: For a number of iterations:
a. Collect Data: Run the current policy πθold (where θold are the parameters before the update) in the environment for T timesteps (or N trajectories), collecting (st,at,rt+1,st+1).
b. Compute Advantages: Calculate advantage estimates A^t for all timesteps t, typically using GAE based on the collected data and the current value function Vϕ. Also compute the value targets Vttarg.
c. Optimize: Repeat for K epochs:
i. Sample mini-batches from the collected data.
ii. Compute the PPO objective LPPO using the current parameters θ,ϕ.
iii. Update the network parameters θ,ϕ using a gradient ascent optimizer (like Adam) on LPPO.
d. Set θold←θ.
The use of multiple epochs (K>1) of mini-batch updates on the same batch of data improves sample efficiency compared to methods that discard data after a single gradient step.
PPO Advantages and Considerations
Advantages:
- Simplicity: Easier to implement than TRPO, relying on standard first-order optimizers.
- Performance: Achieves state-of-the-art or near state-of-the-art results on many continuous and discrete control benchmarks.
- Stability: Generally more stable and less sensitive to hyperparameters than simpler policy gradient methods.
- Sample Efficiency: Reusing data for multiple gradient updates per batch improves sample efficiency.
Considerations:
- Hyperparameter Sensitivity: While often considered robust, PPO still has important hyperparameters to tune, including the clipping factor ϵ, learning rate, GAE parameters (λ,γ), coefficients c1,c2, mini-batch size, and number of optimization epochs K.
- Lack of Strict Guarantees: The clipping mechanism is a heuristic; unlike TRPO, it doesn't offer formal guarantees of monotonic improvement, though it works very well in practice.
- Implementation Details Matter: Small implementation choices (e.g., value function clipping, reward scaling, observation normalization, network initialization) can significantly impact performance, as highlighted in many practical PPO implementations and libraries.
In summary, PPO represents a significant practical advance in policy gradient methods. By introducing mechanisms like the clipped surrogate objective, it provides a way to stabilize policy updates effectively without the implementation overhead of methods like TRPO, making it a widely adopted and powerful tool for tackling complex reinforcement learning problems.