In the previous section, we introduced the REINFORCE algorithm, a fundamental Monte Carlo policy gradient method. Recall the update rule for the policy parameters θ:
θ←θ+αGt∇θlnπθ(at∣st)Here, Gt is the total discounted return starting from time step t. While this update rule correctly pushes the parameters in the direction of increased expected return, it often suffers from a significant drawback: high variance.
The return Gt is calculated based on a single, complete trajectory sampled according to the current policy πθ. The rewards received along this trajectory can vary substantially due to the stochastic nature of the environment and the policy itself. A slight change in an early action might lead to a completely different sequence of states and rewards, resulting in a vastly different Gt.
This high variance in Gt translates directly into high variance in the gradient estimates ∇θlnπθ(at∣st). Imagine two different episodes starting from the same state st. In one episode, the agent gets lucky and achieves a high return Gt=10. In another, perhaps due to an unlucky transition, the return is much lower, Gt=−5. The resulting gradient updates for the policy parameters θ will point in very different directions, even if the action at taken was the same.
This noisy gradient signal makes learning inefficient and unstable. The parameter updates can oscillate wildly, requiring many samples (episodes) to average out the noise and make consistent progress. Consequently, the learning process can be extremely slow.
To address this high variance, we can modify the policy gradient update by subtracting a baseline b(st) from the return Gt. The crucial requirement for the baseline is that it must not depend on the action at; it should only depend on the state st.
The modified update rule becomes:
θ←θ+α(Gt−b(st))∇θlnπθ(at∣st)Why does this help? Intuitively, we are no longer just increasing the probability of an action based on the raw return Gt. Instead, we are comparing the obtained return Gt to an expected or average return b(st) for being in state st.
This comparison focuses the update on the relative quality of the action at given the state st, rather than the absolute return Gt, which can fluctuate greatly. By centering the returns around a baseline, we reduce the magnitude and variability of the update term, leading to lower variance in the gradient estimates and more stable learning.
A critical property of the baseline b(st) is that subtracting it does not change the expected value of the gradient update. In other words, it reduces variance without introducing bias. We can show this mathematically. The expected value of the term we subtracted, scaled by the gradient log-probability, is zero:
Eπθ[b(st)∇θlnπθ(at∣st)]=s∑d(s)a∑πθ(a∣s)b(s)∇θlnπθ(a∣s)Using the identity ∇θlnπθ(a∣s)=πθ(a∣s)∇θπθ(a∣s), we get:
=s∑d(s)b(s)a∑πθ(a∣s)πθ(a∣s)∇θπθ(a∣s) =s∑d(s)b(s)a∑∇θπθ(a∣s)Since the sum of probabilities ∑aπθ(a∣s) must equal 1 for any state s, its gradient with respect to θ is zero:
=s∑d(s)b(s)∇θa∑πθ(a∣s) =s∑d(s)b(s)∇θ(1)=0Because the expected value of the subtracted term is zero, the expected value of the overall gradient remains unchanged. The baseline successfully reduces variance without altering the direction the policy parameters should move in, on average.
What makes a good baseline b(st)? An effective and widely used choice is an estimate of the state-value function, V(st). The state-value function V(st) represents the expected return starting from state st and following the current policy πθ. This is precisely the kind of "expected outcome" we want to compare Gt against.
So, we use an approximation v^(st,w), parameterized by weights w, as our baseline:
b(st)=v^(st,w)The policy update then becomes:
θ←θ+α(Gt−v^(st,w))∇θlnπθ(at∣st)The term (Gt−v^(st,w)) is now an estimate of the advantage of taking action at in state st. Recall that the advantage function is defined as A(s,a)=Q(s,a)−V(s). Since Gt is a Monte Carlo sample (an estimate) of Qπθ(st,at) (under policy πθ for the specific action at taken) and v^(st,w) estimates Vπθ(st), their difference approximates the advantage Aπθ(st,at). Using the advantage estimate generally leads to lower variance than using the raw return Gt.
How do we get the value function estimate v^(st,w)? We need to learn it! This value function can be learned concurrently with the policy using methods we've already encountered, such as Monte Carlo prediction or Temporal-Difference learning (like TD(0)), often using techniques like function approximation (linear or neural networks) if the state space is large. This involves updating the value function parameters w based on the observed returns or TD errors.
This approach, where we learn both a policy (the "actor") and a value function (the "critic" which provides the baseline), forms the foundation for Actor-Critic methods, which we will introduce in the next section. By having the critic evaluate the actor's actions relative to the expected value of the state, we can significantly stabilize and accelerate policy gradient learning.
© 2025 ApX Machine Learning