Value-based methods like DQN have limitations, especially with continuous actions or when stochastic policies are needed. Policy gradient methods offer a different path: directly optimizing the parameters of the policy itself. But how do we actually do that? How do we know which way to adjust the policy parameters to improve performance?
Our goal is to find policy parameters that maximize the expected total return. We can define an objective function, often denoted as , which represents the expected return when following policy . For instance, in an episodic task, this could be the expected cumulative discounted reward starting from an initial state distribution:
Here, represents a full trajectory generated by following the policy , is the discount factor, is the reward received after taking action in state , and is the total discounted return for the entire episode.
To maximize , we can use gradient ascent. The update rule for the policy parameters would look like this:
where is the learning rate and is the gradient of the objective function with respect to the policy parameters. The core challenge lies in calculating this gradient, . How does changing the policy parameters affect the expected total return? This involves not only how the policy chooses actions but also how the resulting trajectories and reward distributions change, which seems complicated to compute directly.
This is where the Policy Gradient Theorem comes in. It provides a way to compute this gradient without needing to know or differentiate the environment's dynamics (the transition probabilities ). This theorem is foundational for all policy gradient algorithms.
One common form of the Policy Gradient Theorem states that the gradient of the objective function is given by:
Let's break down this important expression:
You might wonder how we arrived at the term. It comes from a useful identity sometimes called the "log-derivative trick" or "likelihood ratio trick". The gradient of the policy can be rewritten as:
This reformulation is significant because it allows us to express the gradient as an expectation involving itself. This means we can estimate the gradient using samples (actions ) drawn from the current policy , without needing to explicitly calculate how changes in affect the state visitation distribution.
The Policy Gradient Theorem provides an intuitive update mechanism. The term indicates how to change to increase (or decrease) the probability of action in state . The term acts as a weighting factor for this update direction.
Essentially, the policy gradient method reinforces actions proportionally to the total return they ultimately lead to.
The Policy Gradient Theorem provides the theoretical justification for algorithms like REINFORCE, which we will explore next. REINFORCE directly uses Monte Carlo sampling (running full episodes) to estimate the expectation in the theorem and update the policy parameters. As we'll see, while straightforward, this approach can suffer from high variance in the gradient estimates, motivating further refinements like using baselines.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with