While the foundational policy gradient methods, such as REINFORCE, offer a direct way to optimize parameterized policies, they often encounter significant practical difficulties, particularly in complex environments. Understanding these challenges motivates the development of the more advanced Actor-Critic algorithms discussed later in this chapter.
Recall the basic policy gradient update, derived from the Policy Gradient Theorem. For an objective function J(θ) representing the expected total return, the gradient is estimated as:
∇θJ(θ)=Eτ∼πθ[t=0∑T−1∇θlogπθ(at∣st)Gt]Here, τ is a trajectory (s0,a0,r1,s1,a1,...,sT−1,aT−1,rT,sT) generated by following the policy πθ, and Gt=∑k=tT−1γk−tRk+1 is the discounted return starting from timestep t. In practice, this expectation is approximated using Monte Carlo sampling, averaging the gradient components over multiple trajectories collected using the current policy πθ.
The most prominent issue with basic policy gradient methods like REINFORCE is the high variance of the gradient estimates. This variance stems directly from the use of the Monte Carlo return Gt as the scaling factor for the policy gradient term ∇θlogπθ(at∣st).
Consider why Gt is noisy:
This high variance means that the gradient estimate obtained from a finite batch of trajectories can be very noisy. The estimated gradient direction might point far away from the true gradient direction, leading to several problems:
Illustrative comparison of learning progress with high-variance gradient updates (typical of basic policy gradients) versus smoother, lower-variance updates. High variance can lead to erratic and slower overall improvement.
High variance directly contributes to sample inefficiency. Because each sample trajectory provides such a noisy estimate of the gradient, a large number of trajectories must be collected under the current policy to obtain a reasonably accurate update direction. This makes learning expensive in terms of interaction time and data requirements, especially compared to some value-based methods that can learn more effectively from individual transitions using bootstrapping (though bootstrapping introduces its own biases).
Furthermore, standard REINFORCE typically waits until the end of an episode to calculate the returns Gt and perform updates. This means learning signals are delayed, and information from intermediate rewards is not used as promptly as in Temporal Difference (TD) methods.
Another related difficulty is the credit assignment problem. The basic REINFORCE algorithm updates the probability of all actions taken in a trajectory based on the total return Gt (or often just G0). If a trajectory yields a high total return, all actions within that trajectory are reinforced, even if some specific actions were actually detrimental but were counteracted by later lucky circumstances or good actions. Conversely, a single bad action leading to a poor overall return might unfairly penalize preceding good actions.
Using the return from the current time step onwards, Gt, instead of the total return G0, helps alleviate this by only reinforcing actions based on subsequent rewards. However, Gt still aggregates rewards over potentially many steps, making it difficult to isolate the immediate consequence of action at. The variance issue persists, as Gt remains a noisy estimate of the action's true value.
These challenges. high variance, sample inefficiency, and difficult credit assignment. necessitate improvements over the basic policy gradient formulation. Actor-Critic methods, which we explore next, directly target the high variance issue by introducing a learned value function (the critic) to provide more stable and informative evaluations of the actor's actions, replacing or augmenting the noisy Monte Carlo returns Gt. This forms the basis for developing more stable and efficient policy optimization algorithms.
© 2025 ApX Machine Learning