Basic policy gradients, such as REINFORCE, suffer from high variance. The Advantage Actor-Critic (A2C) framework addresses this by using a state-value function as a baseline. In A2C, the policy gradient update depends on the advantage function, . The central question then becomes: how do we best estimate this advantage function using the data collected by the agent?
Different estimators for exist, each with its own trade-offs between bias and variance. Remember, our goal is to get an estimate that leads to stable and efficient policy updates.
Consider the TD error at time :
This is an estimate of the advantage . Why? Because is an estimate of . Using directly as the advantage estimate offers low variance because it depends only on the next reward and the value function estimate at the next state. However, it can be biased, especially if the value function estimate is inaccurate. This is essentially the advantage estimate used in basic one-step Actor-Critic.
Alternatively, we could use the Monte Carlo estimate. We estimate the advantage as the total discounted return minus the baseline :
In practice, we use the return accumulated until the end of an episode (or a fixed horizon). This estimate is unbiased if is the true value function, but it suffers from high variance because it sums potentially many stochastic rewards over the trajectory.
We can also define N-step advantage estimates that blend these two extremes:
This uses steps of real rewards and then bootstraps using the value function estimate . Increasing generally reduces bias but increases variance. This suggests a spectrum of possible estimators.
Generalized Advantage Estimation, or GAE, provides a more sophisticated way to navigate this bias-variance trade-off. It introduces a parameter, (where ), to explicitly control the weighting between bias and variance. The GAE formula computes the advantage estimate as an exponentially weighted average of TD errors over multiple time steps:
Here, is the standard discount factor, and is the GAE parameter that tunes the trade-off. Let's examine the boundary cases for :
Case : If we set , the sum collapses to only the first term ():
This is exactly the one-step TD error estimate. It typically has low variance but can be biased if the value function is inaccurate.
Case : If we set , the formula becomes:
This sum telescopes. If we ignore function approximation errors and assume is the true value function, this sum converges to:
This is essentially the unbiased Monte Carlo advantage estimate, which tends to have high variance.
Intermediate (): Values of between 0 and 1 create estimates that interpolate between the TD(0) advantage and the Monte Carlo advantage. A higher (closer to 1) gives more weight to longer-term reward information, reducing bias at the cost of increased variance. A lower (closer to 0) relies more heavily on the current value function estimate, reducing variance but potentially increasing bias. This allows practitioners to fine-tune the estimator based on the specific problem and the quality of the value function estimate.
Illustration of how bias decreases and variance increases as the GAE parameter goes from 0 (TD advantage) to 1 (Monte Carlo advantage).
The infinite sum in the GAE definition isn't practical for computation. Fortunately, we can compute it efficiently using a recursive formula, typically working backward from the end of a trajectory or a collected batch of experience of length .
Let be the last state in the sequence.
Calculate the TD errors for all steps :
If is a terminal state, is defined as 0.
Compute the GAE estimates backward from down to : Set if is terminal. If using bootstrapping after steps (i.e., the episode didn't terminate), you might calculate if data for is available, or simply set and rely on the TD errors up to . A common approach for non-terminal states at the end of a batch is to use as initially, or simply compute . A slightly more precise way starts the recursion assuming .
Then, iterate backward:
Note that and are provided by the current estimate from the critic network.
GAE has become a standard component in many modern actor-critic algorithms, particularly TRPO and PPO. Its main advantage is the significant reduction in variance compared to Monte Carlo estimates, often leading to much more stable and faster learning, while controlling the bias introduced compared to simple TD error estimates.
The parameter acts as a control knob. While corresponds to the unbiased Monte Carlo estimate and to the potentially biased TD(0) estimate, values in between (e.g., or ) often yield the best empirical performance by finding a good balance. Selecting the optimal is problem-dependent and becomes another hyperparameter to tune during agent training.
In summary, Generalized Advantage Estimation provides a principled and effective method for estimating the advantage function in actor-critic algorithms. By introducing the parameter, GAE offers a flexible mechanism to manage the important bias-variance trade-off, contributing significantly to the stability and performance of advanced policy gradient methods.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with