Policy gradient methods, such as REINFORCE, provide a direct approach to optimizing parameterized policies. However, these methods frequently face significant practical difficulties, particularly in complex environments. Addressing these challenges drives the creation of more advanced Actor-Critic algorithms.
Recall the basic policy gradient update, derived from the Policy Gradient Theorem. For an objective function representing the expected total return, the gradient is estimated as:
Here, is a trajectory generated by following the policy , and is the discounted return starting from timestep . In practice, this expectation is approximated using Monte Carlo sampling, averaging the gradient components over multiple trajectories collected using the current policy .
The most prominent issue with basic policy gradient methods like REINFORCE is the high variance of the gradient estimates. This variance stems directly from the use of the Monte Carlo return as the scaling factor for the policy gradient term .
Consider why is noisy:
This high variance means that the gradient estimate obtained from a finite batch of trajectories can be very noisy. The estimated gradient direction might point far away from the true gradient direction, leading to several problems:
Illustrative comparison of learning progress with high-variance gradient updates (typical of basic policy gradients) versus smoother, lower-variance updates. High variance can lead to erratic and slower overall improvement.
High variance directly contributes to sample inefficiency. Because each sample trajectory provides such a noisy estimate of the gradient, a large number of trajectories must be collected under the current policy to obtain a reasonably accurate update direction. This makes learning expensive in terms of interaction time and data requirements, especially compared to some value-based methods that can learn more effectively from individual transitions using bootstrapping (though bootstrapping introduces its own biases).
Furthermore, standard REINFORCE typically waits until the end of an episode to calculate the returns and perform updates. This means learning signals are delayed, and information from intermediate rewards is not used as promptly as in Temporal Difference (TD) methods.
Another related difficulty is the credit assignment problem. The basic REINFORCE algorithm updates the probability of all actions taken in a trajectory based on the total return (or often just ). If a trajectory yields a high total return, all actions within that trajectory are reinforced, even if some specific actions were actually detrimental but were counteracted by later lucky circumstances or good actions. Conversely, a single bad action leading to a poor overall return might unfairly penalize preceding good actions.
Using the return from the current time step onwards, , instead of the total return , helps alleviate this by only reinforcing actions based on subsequent rewards. However, still aggregates rewards over potentially many steps, making it difficult to isolate the immediate consequence of action . The variance issue persists, as remains a noisy estimate of the action's true value.
These challenges. high variance, sample inefficiency, and difficult credit assignment. necessitate improvements over the basic policy gradient formulation. Actor-Critic methods, which we explore next, directly target the high variance issue by introducing a learned value function (the critic) to provide more stable and informative evaluations of the actor's actions, replacing or augmenting the noisy Monte Carlo returns . This forms the basis for developing more stable and efficient policy optimization algorithms.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with