As we noted earlier, the REINFORCE algorithm, while foundational for policy gradients, often struggles in practice due to the high variance inherent in its gradient estimates. The policy update relies on the total return Gt observed after taking action at in state st. Because Gt can vary significantly depending on the subsequent stochastic transitions and actions, the resulting gradient estimates ∇θlogπθ(at∣st)Gt can be noisy. This noise slows down learning and can prevent the policy from converging reliably to a good solution. Imagine trying to adjust a knob based on wildly fluctuating feedback; it's hard to know which direction is truly better.
Actor-Critic methods offer a compelling solution by introducing a baseline into the policy gradient calculation. The core idea is surprisingly simple: subtract a value from the return Gt that depends only on the state st, let's call it b(st). The modified policy gradient update term becomes:
∇θlogπθ(at∣st)(Gt−b(st))Why does this work? Crucially, subtracting a state-dependent baseline does not change the expected value of the policy gradient, meaning it doesn't introduce bias into the update direction. We can show this mathematically. The expected value of the term we subtract is:
Eπθ[∇θlogπθ(at∣st)b(st)]=s∑dπθ(s)a∑πθ(a∣s)∇θlogπθ(a∣s)b(s)Using the identity ∇θlogπθ(a∣s)=πθ(a∣s)∇θπθ(a∣s), this becomes:
s∑dπθ(s)a∑∇θπθ(a∣s)b(s)=s∑dπθ(s)b(s)∇θa∑πθ(a∣s)Since ∑aπθ(a∣s)=1 for any state s, its gradient with respect to θ is zero: ∇θ1=0. Therefore, the expected value of the subtracted term is zero:
Eπθ[∇θlogπθ(at∣st)b(st)]=0This confirms that subtracting a state-dependent baseline b(st) maintains the unbiasedness of the policy gradient estimate.
While the expected gradient remains the same, the variance of the gradient estimate can be significantly reduced. Consider the term (Gt−b(st)). If we choose b(st) to be a good estimate of the average return from state st, then this term represents how much better or worse the actually observed return Gt was compared to the average expectation from that state.
By centering the returns around a state-specific average, the magnitude of the updates is scaled down. Instead of large positive updates based on high absolute returns (which might just come from a generally rewarding state), the updates focus on the relative quality of the action taken in that specific context. This leads to a more stable and often faster learning process.
Sample returns (Gt) from a trajectory. Subtracting a baseline (here, the overall average return) centers the values used for policy updates around zero. Positive values correspond to returns better than the baseline, negative values to returns worse than the baseline. This centering helps reduce the update variance.
What is the best choice for b(st)? While a simple constant baseline (like the average return over an episode) can help somewhat, a much more effective baseline is the state-value function, Vπθ(st). This function, by definition, represents the expected return starting from state st and following the current policy πθ.
Using V(st) as the baseline, the update term becomes:
∇θlogπθ(at∣st)(Gt−V(st))The term (Gt−V(st)) is an estimate of the Advantage Function, Aπθ(st,at). The advantage function measures how much better taking action at in state st is compared to the average action chosen by the policy πθ from state st. Formally:
Aπθ(st,at)=Qπθ(st,at)−Vπθ(st)where Qπθ(st,at) is the action-value function. Since Gt is a Monte Carlo sample estimate of Qπθ(st,at), the term (Gt−V(st)) serves as a sample estimate of the advantage Aπθ(st,at).
Using the state-value function V(st) as the baseline is theoretically shown to be the optimal choice (in the sense of minimizing the variance of the gradient estimate) among all functions that depend only on the state st.
This naturally leads us to the Actor-Critic architecture. We need a way to estimate V(st) to use it as a baseline. This is precisely the role of the critic.
Conceptual diagram of an Actor-Critic architecture using V(s) as a baseline. The Actor selects actions based on the policy. The Critic evaluates states by learning V(s). The Critic's value estimate V(s) is used as a baseline to compute an Advantage estimate, which in turn provides a lower-variance signal for updating the Actor's policy. Both components learn from interactions with the environment.
In summary, introducing a state-dependent baseline, particularly the state-value function V(st), is a powerful technique for reducing the variance of policy gradient estimates without introducing bias. This naturally motivates the Actor-Critic framework, where the critic learns the value function to provide this baseline, and the actor updates the policy using the resulting advantage signal. The following sections will explore specific algorithms like A2C/A3C that implement this idea effectively and introduce further refinements like Generalized Advantage Estimation (GAE).
© 2025 ApX Machine Learning