The REINFORCE algorithm is a fundamental Monte Carlo policy gradient method. Understanding its update rule for the policy parameters is essential for effectively implementing baselines for variance reduction.
Here, is the total discounted return starting from time step . While this update rule correctly pushes the parameters in the direction of increased expected return, it often suffers from a significant drawback: high variance.
The return is calculated based on a single, complete trajectory sampled according to the current policy . The rewards received along this trajectory can vary substantially due to the stochastic nature of the environment and the policy itself. A slight change in an early action might lead to a completely different sequence of states and rewards, resulting in a very different .
This high variance in translates directly into high variance in the gradient estimates . Imagine two different episodes starting from the same state . In one episode, the agent gets lucky and achieves a high return . In another, perhaps due to an unlucky transition, the return is much lower, . The resulting gradient updates for the policy parameters will point in very different directions, even if the action taken was the same.
This noisy gradient signal makes learning inefficient and unstable. The parameter updates can oscillate wildly, requiring many samples (episodes) to average out the noise and make consistent progress. Consequently, the learning process can be extremely slow.
To address this high variance, we can modify the policy gradient update by subtracting a baseline from the return . The important requirement for the baseline is that it must not depend on the action ; it should only depend on the state .
The modified update rule becomes:
Why does this help? Intuitively, we are no longer just increasing the probability of an action based on the raw return . Instead, we are comparing the obtained return to an expected or average return for being in state .
This comparison focuses the update on the relative quality of the action given the state , rather than the absolute return , which can fluctuate greatly. By centering the returns around a baseline, we reduce the magnitude and variability of the update term, leading to lower variance in the gradient estimates and more stable learning.
A critical property of the baseline is that subtracting it does not change the expected value of the gradient update. In other words, it reduces variance without introducing bias. We can show this mathematically. The expected value of the term we subtracted, scaled by the gradient log-probability, is zero:
Using the identity , we get:
Since the sum of probabilities must equal 1 for any state , its gradient with respect to is zero:
Because the expected value of the subtracted term is zero, the expected value of the overall gradient remains unchanged. The baseline successfully reduces variance without altering the direction the policy parameters should move in, on average.
What makes a good baseline ? An effective and widely used choice is an estimate of the state-value function, . The state-value function represents the expected return starting from state and following the current policy . This is precisely the kind of "expected outcome" we want to compare against.
So, we use an approximation , parameterized by weights , as our baseline:
The policy update then becomes:
The term is now an estimate of the advantage of taking action in state . Recall that the advantage function is defined as . Since is a Monte Carlo sample (an estimate) of (under policy for the specific action taken) and estimates , their difference approximates the advantage . Using the advantage estimate generally leads to lower variance than using the raw return .
How do we get the value function estimate ? We need to learn it! This value function can be learned concurrently with the policy using methods we've already encountered, such as Monte Carlo prediction or Temporal-Difference learning (like TD(0)), often using techniques like function approximation (linear or neural networks) if the state space is large. This involves updating the value function parameters based on the observed returns or TD errors.
This approach, where we learn both a policy (the "actor") and a value function (the "critic" which provides the baseline), forms the foundation for Actor-Critic methods, which we will introduce in the next section. By having the critic evaluate the actor's actions relative to the expected value of the state, we can significantly stabilize and accelerate policy gradient learning.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with