While basic policy gradient methods update the policy by taking steps in the direction of the estimated gradient, choosing the right step size (learning rate) is challenging. Too small a step leads to slow progress, while too large a step can drastically change the policy, potentially leading to a catastrophic drop in performance from which the agent might never recover. We saw how Actor-Critic methods use a critic to get better gradient estimates, but the step size issue remains.
Trust Region Policy Optimization (TRPO) addresses this stability problem directly. Instead of just following the gradient, TRPO aims to maximize the policy's expected performance improvement in each step, but critically, it adds a constraint: the new policy must not diverge too far from the old policy. This constrained region around the current policy is the "trust region," implying we only trust our performance estimate within this local area.
Imagine you are adjusting the parameters θ of your policy πθ. Standard policy gradient methods tell you which direction g (the gradient) increases performance locally. They then update θk+1=θk+αg, where α is the learning rate. TRPO takes a different approach. It seeks to find the best possible parameters θ that maximize an objective function representing performance improvement, while ensuring that the "distance" between the new policy πθ and the old policy πθk remains small.
This "distance" is typically measured using the Kullback-Leibler (KL) divergence, DKL(πθk∣∣πθ), which quantifies how much the probability distribution defined by the new policy differs from the old one.
TRPO aims to maximize a surrogate objective function, which approximates the expected advantage of the new policy πθ relative to the old policy πθk. This objective uses importance sampling to correct for the fact that we are using data collected with πθk to evaluate πθ:
Lθk(θ)=Es∼ρθk,a∼πθk[πθk(a∣s)πθ(a∣s)Aπθk(s,a)]Here, ρθk is the distribution of states encountered when running policy πθk, and Aπθk(s,a) is the advantage estimate (e.g., from GAE) under the old policy.
The key innovation in TRPO is constraining the update step. We want the average KL divergence between the old and new policies, over the states visited by the old policy, to be less than some small constant δ:
Es∼ρθk[DKL(πθk(⋅∣s)∣∣πθ(⋅∣s))]≤δSo, the full TRPO optimization problem at each iteration k is:
θk+1=argθmaxLθk(θ) subject to DˉKL(πθk∣∣πθ)≤δwhere DˉKL denotes the average KL divergence mentioned above. The hyperparameter δ defines the size of the trust region.
Solving this constrained optimization problem exactly is difficult. TRPO makes it tractable using approximations:
Objective Approximation: The objective function Lθk(θ) is approximated by its first-order Taylor expansion around θk: Lθk(θ)≈Lθk(θk)+gT(θ−θk), where g=∇θLθk(θ)∣θ=θk is the policy gradient evaluated at the current parameters θk. Since Lθk(θk) is constant, maximizing this is equivalent to maximizing gT(θ−θk).
Constraint Approximation: The KL divergence constraint is approximated by its second-order Taylor expansion around θk: DˉKL(πθk∣∣πθ)≈21(θ−θk)TF(θ−θk), where F is the Fisher Information Matrix (FIM). The FIM acts like a metric on the policy parameter space, measuring how much the output distribution of the policy changes for small changes in parameters. It's defined as F=Eπθk[∇θlogπθ(a∣s)∇θlogπθ(a∣s)T]∣θ=θk.
Substituting these approximations, the problem becomes finding the update step Δθ=θ−θk that solves:
ΔθmaxgTΔθ subject to 21ΔθTFΔθ≤δThis is a standard quadratic program with a quadratic constraint.
Calculating, storing, and inverting the FIM F directly is computationally expensive, especially for large neural networks. TRPO cleverly avoids this by using the Conjugate Gradient (CG) algorithm. CG is an iterative method for solving linear systems of equations of the form Fx=g. Importantly, CG only requires the ability to compute matrix-vector products, specifically Fv for arbitrary vectors v. This FIM-vector product can be computed efficiently without explicitly forming F, using techniques similar to calculating the gradient itself (often involving automatic differentiation).
CG approximately finds the optimal step direction Δθ∝F−1g. The magnitude of the step is then scaled to satisfy the KL constraint 21ΔθTFΔθ=δ.
Because the Taylor approximations are only accurate locally, the step calculated via CG might not actually satisfy the KL constraint or improve the true (unapproximated) objective function. Therefore, TRPO performs a line search: it tries exponentially shrinking step sizes along the direction found by CG (αjΔθ for j=0,1,2,..., where α∈(0,1) is a backtracking coefficient) until it finds a step that both satisfies the average KL constraint and yields a positive value for the surrogate objective Lθk.
The following diagram illustrates the sequence of operations within a single TRPO policy update:
This diagram shows the typical workflow for one update step in TRPO. It involves data collection, advantage estimation, gradient calculation, using the conjugate gradient method with Fisher-vector products to find a search direction, and finally a line search to ensure constraints are met before updating the policy.
Advantages:
Disadvantages:
TRPO represented a significant step forward in developing reliable policy optimization algorithms. However, its complexity motivated the search for simpler methods that could achieve similar stability guarantees. This led directly to the development of Proximal Policy Optimization (PPO), which we will discuss next.
© 2025 ApX Machine Learning