As we saw in the previous section, the REINFORCE algorithm uses the full Monte Carlo return Gt to scale the gradient ∇θlogπ(at∣st;θ). While this provides an unbiased estimate of the policy gradient, the high variance of Gt often leads to noisy updates and slow convergence. Imagine an episode where all rewards are large and positive; Gt will be large for all timesteps, pushing the probabilities of all actions taken higher, even if some actions were much better than others within that successful episode. Conversely, a poor episode might unduly penalize all actions taken.
To mitigate this, we can introduce a baseline function b(s) that depends only on the state st. We subtract this baseline from the return Gt in the policy gradient update:
∇θJ(θ)≈N1i=1∑Nt=0∑Ti−1∇θlogπ(ai,t∣si,t;θ)(Gi,t−b(si,t))Why is this valid? Crucially, subtracting a state-dependent baseline does not introduce bias into the gradient estimate. The expected value of the subtracted term is zero:
Eπθ[∇θlogπ(at∣st;θ)b(st)]=s∑dπθ(s)a∑π(a∣s;θ)[∇θlogπ(a∣s;θ)b(s)] =s∑dπθ(s)a∑∇θπ(a∣s;θ)b(s) =s∑dπθ(s)b(s)∇θa∑π(a∣s;θ) =s∑dπθ(s)b(s)∇θ1=0Here, dπθ(s) is the state distribution under policy πθ. Since the expectation of the term we subtract is zero, the overall expectation of the gradient remains unchanged.
While it doesn't change the expectation, a well-chosen baseline can significantly reduce the variance of the gradient estimate. The intuition is to center the returns around a reference point. Instead of simply scaling the gradient by the total return Gt, we scale it by how much better or worse that return was compared to what we typically expect from state st. If Gt>b(st), the action at led to a better-than-expected outcome, and its probability should be increased. If Gt<b(st), the action led to a worse-than-expected outcome, and its probability should be decreased (or increased less strongly).
What makes a good baseline? Ideally, b(st) should be an estimate of the expected return from state st.
Using V(st) as the baseline gives rise to the term Gt−V(st). Recall that the action-value function is Q(st,at)=Eπθ[Gt∣St=st,At=at], and the advantage function is defined as A(st,at)=Q(st,at)−V(st). Since Gt is a Monte Carlo sample of the return starting from (st,at), the term Gt−V(st) is actually an estimate of the advantage function A(st,at).
Using the advantage estimate centers the scaling factor: actions leading to returns better than the average for that state get positive reinforcement, while actions leading to returns worse than average get negative reinforcement. This often results in a significantly lower variance compared to using the raw return Gt.
Example showing how subtracting a baseline (V(st)≈10 in this simplified case) centers the scaling factor around zero and reduces its variance compared to using raw returns Gt.
Of course, we typically don't know the true V(st) under the current policy. Therefore, we need to learn it concurrently with the policy. This is usually done using function approximation, often with another neural network. We introduce a value function approximator V(s;w) with parameters w.
This value network V(s;w) is trained to predict the expected return from a given state. Since REINFORCE uses Monte Carlo returns Gt, a common approach is to train the value network by minimizing the Mean Squared Error (MSE) between its predictions V(st;w) and the observed returns Gt from the collected trajectories:
L(w)=Eπθ[(Gt−V(st;w))2]So, during training, we perform two main updates in each iteration (after collecting trajectories):
Update Policy Parameters θ: Use the policy gradient with the learned baseline:
Δθ=αθ∇θlogπ(at∣st;θ)(Gt−V(st;w))Note that we treat V(st;w) as a fixed baseline value when calculating the policy gradient; we don't backpropagate the policy loss through the value network.
Update Value Function Parameters w: Minimize the MSE loss between predicted values and actual returns:
Δw=αw∇w(Gt−V(st;w))2=αw⋅2(Gt−V(st;w))(−∇wV(st;w))Here, αθ and αw are the learning rates for the policy and value function, respectively.
This approach of using a learned value function V(s;w) as a baseline for a policy gradient method forms the basis of Actor-Critic algorithms, which we will discuss in detail in the next chapter. The policy network π(a∣s;θ) is the "Actor," deciding which actions to take, and the value network V(s;w) is the "Critic," evaluating the states visited by the Actor and providing the baseline to reduce variance in the Actor's learning signal.
© 2025 ApX Machine Learning