You've now seen the REINFORCE algorithm, a method for directly optimizing the policy parameters θ based on the gradient of the expected return. The core idea is to increase the probability of actions that lead to higher returns and decrease the probability of actions leading to lower returns. The gradient estimate we use in the simplest Monte Carlo version of REINFORCE for a single trajectory is proportional to:
t=0∑T−1Gt∇θlogπ(At∣St;θ)Here, Gt=∑k=t+1Tγk−t−1Rk is the total discounted return experienced starting from state St and taking action At in that specific episode. While this estimate is unbiased (meaning its expected value is the true gradient ∇θJ(θ)), it often suffers from a significant practical problem: high variance.
Variance refers to how much a random variable deviates from its expected value. In the context of REINFORCE, the gradient estimate Gt∇θlogπ(At∣St;θ) can vary dramatically from one episode (or even one time step) to the next.
Why does this happen? It stems directly from the use of the Monte Carlo return Gt.
Consider an agent learning to play a game. An action taken early in the game might be strategically sound, but if the agent happens to get unlucky with random events much later, the resulting Gt might be very low or even negative. The REINFORCE update would then penalize this initially good action based on outcomes that were largely unrelated to the action itself.
High variance in the gradient estimates has several negative consequences for the learning process:
Imagine trying to find the bottom of a valley (representing the optimal policy parameters) by taking steps based on very noisy measurements of the slope. Each measurement might point you in a wildly different direction. Only by averaging many noisy measurements can you get a reliable sense of the actual downhill direction. REINFORCE faces a similar challenge.
High variance estimates fluctuate significantly around the true gradient direction, making learning unstable. Low variance estimates provide a more consistent signal, leading to smoother convergence.
The core issue is that the magnitude of the update step, determined by Gt, is noisy. We are multiplying the score function ∇θlogπ(At∣St;θ) (which tells us which direction to nudge the parameters to make action At more or less likely) by a potentially unreliable estimate of how good that action actually was in the long run for that specific trajectory.
This variance problem is a major challenge for basic policy gradient methods. Fortunately, there are ways to mitigate it. The next section introduces a common and effective technique: using baselines to reduce variance while keeping the gradient estimate unbiased.
© 2025 ApX Machine Learning