While basic policy gradient methods update the policy by taking steps in the direction of the estimated gradient, choosing the right step size (learning rate) is challenging. Too small a step leads to slow progress, while too large a step can drastically change the policy, potentially leading to a catastrophic drop in performance from which the agent might never recover. Actor-Critic methods use a critic to obtain better gradient estimates, but the step size problem remains.
Trust Region Policy Optimization (TRPO) addresses this stability problem directly. Instead of just following the gradient, TRPO aims to maximize the policy's expected performance improvement in each step, but critically, it adds a constraint: the new policy must not diverge too far from the old policy. This constrained region around the current policy is the "trust region," implying we only trust our performance estimate within this local area.
Imagine you are adjusting the parameters of your policy . Standard policy gradient methods tell you which direction (the gradient) increases performance locally. They then update , where is the learning rate. TRPO takes a different approach. It seeks to find the best possible parameters that maximize an objective function representing performance improvement, while ensuring that the "distance" between the new policy and the old policy remains small.
This "distance" is typically measured using the Kullback-Leibler (KL) divergence, , which quantifies how much the probability distribution defined by the new policy differs from the old one.
TRPO aims to maximize a surrogate objective function, which approximates the expected advantage of the new policy relative to the old policy . This objective uses importance sampling to correct for the fact that we are using data collected with to evaluate :
Here, is the distribution of states encountered when running policy , and is the advantage estimate (e.g., from GAE) under the old policy.
The innovation in TRPO is constraining the update step. We want the average KL divergence between the old and new policies, over the states visited by the old policy, to be less than some small constant :
So, the full TRPO optimization problem at each iteration is:
where denotes the average KL divergence mentioned above. The hyperparameter defines the size of the trust region.
Solving this constrained optimization problem exactly is difficult. TRPO makes it tractable using approximations:
Objective Approximation: The objective function is approximated by its first-order Taylor expansion around : , where is the policy gradient evaluated at the current parameters . Since is constant, maximizing this is equivalent to maximizing .
Constraint Approximation: The KL divergence constraint is approximated by its second-order Taylor expansion around : , where is the Fisher Information Matrix (FIM). The FIM acts like a metric on the policy parameter space, measuring how much the output distribution of the policy changes for small changes in parameters. It's defined as .
Substituting these approximations, the problem becomes finding the update step that solves:
This is a standard quadratic program with a quadratic constraint.
Calculating, storing, and inverting the FIM directly is computationally expensive, especially for large neural networks. TRPO cleverly avoids this by using the Conjugate Gradient (CG) algorithm. CG is an iterative method for solving linear systems of equations of the form . Importantly, CG only requires the ability to compute matrix-vector products, specifically for arbitrary vectors . This FIM-vector product can be computed efficiently without explicitly forming , using techniques similar to calculating the gradient itself (often involving automatic differentiation).
CG approximately finds the optimal step direction . The magnitude of the step is then scaled to satisfy the KL constraint .
Because the Taylor approximations are only accurate locally, the step calculated via CG might not actually satisfy the KL constraint or improve the true (unapproximated) objective function. Therefore, TRPO performs a line search: it tries exponentially shrinking step sizes along the direction found by CG for , where is a backtracking coefficient) until it finds a step that both satisfies the average KL constraint and yields a positive value for the surrogate objective .
The following diagram illustrates the sequence of operations within a single TRPO policy update:
This diagram shows the typical workflow for one update step in TRPO. It involves data collection, advantage estimation, gradient calculation, using the conjugate gradient method with Fisher-vector products to find a search direction, and finally a line search to ensure constraints are met before updating the policy.
Advantages:
Disadvantages:
TRPO represented a significant step forward in developing reliable policy optimization algorithms. However, its complexity motivated the search for simpler methods that could achieve similar stability guarantees. This led directly to the development of Proximal Policy Optimization (PPO), which we will discuss next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with