As we discussed earlier in this chapter, basic policy gradients like REINFORCE suffer from high variance. Using a state-value function V(s) as a baseline helps, leading us to the Advantage Actor-Critic (A2C) framework where the policy gradient update depends on the advantage function, A(st,at)=Q(st,at)−V(st). The central question then becomes: how do we best estimate this advantage function using the data collected by the agent?
Different estimators for A(st,at) exist, each with its own trade-offs between bias and variance. Remember, our goal is to get an estimate that leads to stable and efficient policy updates.
Consider the TD error at time t:
δt=rt+1+γV(st+1)−V(st)This δt is an estimate of the advantage A(st,at). Why? Because rt+1+γV(st+1) is an estimate of Q(st,at). Using δt directly as the advantage estimate offers low variance because it depends only on the next reward and the value function estimate at the next state. However, it can be biased, especially if the value function estimate V(s) is inaccurate. This is essentially the advantage estimate used in basic one-step Actor-Critic.
Alternatively, we could use the Monte Carlo estimate. We estimate the advantage as the total discounted return Gt minus the baseline V(st):
A^tMC=Gt−V(st)=(k=0∑∞γkrt+k+1)−V(st)In practice, we use the return accumulated until the end of an episode (or a fixed horizon). This estimate is unbiased if V(st) is the true value function, but it suffers from high variance because it sums potentially many stochastic rewards rt+k+1 over the trajectory.
We can also define N-step advantage estimates that blend these two extremes:
A^t(N)=(k=0∑N−1γkrt+k+1+γNV(st+N))−V(st)This uses N steps of real rewards and then bootstraps using the value function estimate V(st+N). Increasing N generally reduces bias but increases variance. This suggests a spectrum of possible estimators.
Generalized Advantage Estimation, or GAE, provides a more sophisticated way to navigate this bias-variance trade-off. It introduces a parameter, λ (where 0≤λ≤1), to explicitly control the weighting between bias and variance. The GAE formula computes the advantage estimate A^t as an exponentially weighted average of TD errors δt+l over multiple time steps:
A^tGAE(γ,λ)=l=0∑∞(γλ)lδt+lHere, γ is the standard discount factor, and λ is the GAE parameter that tunes the trade-off. Let's examine the boundary cases for λ:
Case λ=0: If we set λ=0, the sum collapses to only the first term (l=0):
A^tGAE(γ,0)=(γ⋅0)0δt=δt=rt+1+γV(st+1)−V(st)This is exactly the one-step TD error estimate. It typically has low variance but can be biased if the value function V is inaccurate.
Case λ=1: If we set λ=1, the formula becomes:
A^tGAE(γ,1)=l=0∑∞γlδt+l=l=0∑∞γl(rt+l+1+γV(st+l+1)−V(st+l))This sum telescopes. If we ignore function approximation errors and assume V is the true value function, this sum converges to:
A^tGAE(γ,1)≈(l=0∑∞γlrt+l+1)−V(st)=Gt−V(st)This is essentially the unbiased Monte Carlo advantage estimate, which tends to have high variance.
Intermediate λ (0<λ<1): Values of λ between 0 and 1 create estimates that interpolate between the TD(0) advantage and the Monte Carlo advantage. A higher λ (closer to 1) gives more weight to longer-term reward information, reducing bias at the cost of increased variance. A lower λ (closer to 0) relies more heavily on the current value function estimate, reducing variance but potentially increasing bias. This allows practitioners to fine-tune the estimator based on the specific problem and the quality of the value function estimate.
Conceptual illustration of how bias decreases and variance increases as the GAE parameter λ goes from 0 (TD advantage) to 1 (Monte Carlo advantage).
The infinite sum in the GAE definition isn't practical for computation. Fortunately, we can compute it efficiently using a recursive formula, typically working backward from the end of a trajectory or a collected batch of experience of length T.
Let sT be the last state in the sequence.
Calculate the TD errors for all steps t=0,1,...,T−1:
δt=rt+1+γV(st+1)−V(st)If st+1 is a terminal state, V(st+1) is defined as 0.
Compute the GAE estimates backward from t=T−1 down to t=0: Set A^TGAE=0 if sT is terminal. If using bootstrapping after T steps (i.e., the episode didn't terminate), you might calculate δT=rT+1+γV(sT+1)−V(sT) if data for T+1 is available, or simply set A^TGAE=0 and rely on the TD errors up to T−1. A common approach for non-terminal states at the end of a batch is to use δT−1 as A^T−1GAE initially, or simply compute A^T−1GAE=δT−1. A slightly more precise way starts the recursion assuming A^TGAE=0.
Then, iterate backward:
A^tGAE=δt+γλA^t+1GAENote that V(st) and V(st+1) are provided by the current estimate from the critic network.
GAE has become a standard component in many modern actor-critic algorithms, particularly TRPO and PPO. Its main advantage is the significant reduction in variance compared to Monte Carlo estimates, often leading to much more stable and faster learning, while controlling the bias introduced compared to simple TD error estimates.
The parameter λ acts as a control knob. While λ=1 corresponds to the unbiased Monte Carlo estimate and λ=0 to the potentially biased TD(0) estimate, values in between (e.g., λ=0.95 or λ=0.97) often yield the best empirical performance by finding a good balance. Selecting the optimal λ is problem-dependent and becomes another hyperparameter to tune during agent training.
In summary, Generalized Advantage Estimation provides a principled and effective method for estimating the advantage function in actor-critic algorithms. By introducing the λ parameter, GAE offers a flexible mechanism to manage the crucial bias-variance trade-off, contributing significantly to the stability and performance of advanced policy gradient methods.
© 2025 ApX Machine Learning