Optimizing the policy with the PPO objective requires a reliable estimate of how much better a chosen action (generating a specific token) performs compared to the average action the policy would take in that state (the current generated sequence). The advantage function, , provides this estimate. Relying solely on the immediate reward , which typically combines a reward model signal and a KL penalty, proves insufficient for this optimization. This is because immediate rewards neglect the long-term consequences of an action. Instead, the total accumulated reward, or return, must be examined and compared against a baseline. This baseline is established by a value function , which estimates the expected return from state .
The return is the total discounted reward received starting from time step until the end of the episode (the generated sequence). For a sequence of length , it's defined as:
Here, is the reward received after taking an action in state . In the RLHF context for LLMs, the reward at each step typically consists of two components: a penalty based on the KL divergence between the current policy and the reference (SFT) policy, and potentially a contribution from the final reward model score assigned to the complete sequence . A common practice is to apply the KL penalty at each token generation step , and add the final reward model score only at the last step, . So, for , and .
The discount factor determines the present value of future rewards. A closer to 1 gives more weight to future rewards, while a closer to 0 prioritizes immediate rewards. For text generation tasks, is often set close to 1 (e.g., 0.99 or 1.0) because the quality assessment (via the reward model) often depends on the complete sequence.
The advantage function measures the relative value of taking action in state compared to the expected value of the state under the current policy . It's formally defined as:
where is the action-value function, representing the expected return after taking action in state and following the policy thereafter. Since we typically learn directly using a value network (the critic), we can estimate using the immediate reward and the value of the next state :
Substituting this into the advantage definition gives the one-step Temporal Difference (TD) error, , often used as a basic advantage estimator:
This estimate tells us whether the observed outcome () was better or worse than what was expected ().
While the TD error is an unbiased estimate of the advantage (if is accurate), it can suffer from high variance because it relies heavily on the single-step reward and the next-state value estimate . This variance can make the PPO updates unstable, especially in complex tasks like language generation.
Generalized Advantage Estimation (GAE) is a technique designed to reduce this variance by incorporating information from multiple time steps, effectively blending the single-step TD error with longer-term Monte Carlo returns. GAE introduces a parameter (often called the GAE lambda) to control this trade-off.
The GAE advantage estimator is calculated as an exponentially weighted sum of TD errors:
where is the TD error at time step . (Note: is typically defined as 0 if is a terminal state).
The diagram below illustrates how TD errors over multiple steps contribute to the GAE calculation for .
Calculation flow for Generalized Advantage Estimation (GAE). Rewards () and value function estimates () for subsequent states () are used to compute Temporal Difference (TD) errors (). These TD errors are then combined using weights based on and to form the final GAE advantage estimate .
In a typical RLHF implementation using libraries like TRL (Transformer Reinforcement Learning), GAE is computed efficiently. During the PPO rollout phase, the policy generates sequences token by token. For each token generated up to the end of the sequence (or a maximum length), the following are stored:
Once a batch of complete sequences is generated:
It is a standard practice to normalize the advantages across a batch before using them in the PPO loss calculation. This involves subtracting the mean and dividing by the standard deviation of the advantages within the batch, which helps stabilize training by preventing excessively large policy updates.
By carefully calculating returns and using GAE to estimate advantages, we provide the PPO algorithm with a stable and informative signal to guide the LLM policy towards generating responses that align better with the preferences captured by the reward model, while mitigating the instability associated with high-variance gradient estimates.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•