The core idea of value-based methods like Q-Learning is to learn the value of state-action pairs, , and then derive a policy (often implicitly, like epsilon-greedy) based on these values. Policy Gradient methods offer a fundamentally different approach: they directly parameterize the policy and optimize its parameters to maximize expected return.
Let's denote the policy as , representing the probability of taking action in state under policy parameters . The goal is to find the parameters that maximize a performance objective function, typically the expected total discounted reward starting from an initial state distribution. We can define this objective as:
Here, represents a trajectory generated by following policy , is the discount factor, and is the reward received after taking action in state .
The core challenge is computing the gradient of this objective function with respect to the policy parameters, . Directly differentiating the objective seems difficult because the expectation depends on the distribution of trajectories induced by , which itself depends on . The Policy Gradient Theorem provides a convenient way around this. It states that the gradient of the objective function can be expressed as:
Alternatively, using the expectation notation over states and actions visited under :
where is the stationary state distribution under policy . The term is sometimes called the "score function". It tells us how to adjust the parameters to increase the log-probability of taking action in state . The theorem tells us to weight this direction by the action-value function . Intuitively, this makes sense: we want to increase the probability of actions that lead to higher-than-average returns and decrease the probability of actions that lead to lower-than-average returns.
The Policy Gradient Theorem provides the form of the gradient, but we still need , which is generally unknown. The REINFORCE algorithm (also known as Monte Carlo Policy Gradient) offers a simple way to estimate this gradient using sampled trajectories.
REINFORCE uses the complete return from time step onwards, , as an unbiased sample estimate of . Since we are sampling trajectories, we can approximate the expectation with an average over sampled trajectories. For a single trajectory, the update direction for each time step is .
The basic REINFORCE algorithm proceeds as follows:
Here's a visual representation of the REINFORCE update logic:
A schematic overview of the REINFORCE algorithm: generating a trajectory using the current policy , calculating returns and score functions for each step, and then updating the policy parameters .
Policy gradient methods like REINFORCE have several advantages:
However, REINFORCE also suffers from a significant disadvantage:
This high variance is a primary motivation for developing more advanced policy gradient methods, such as Actor-Critic algorithms (covered in Chapter 3), which use a learned value function (the critic) to reduce the variance of the policy gradient estimate (from the actor).
Just as value functions can be approximated using linear functions or neural networks for large state spaces, the policy can also be represented using function approximators. In deep reinforcement learning, is typically represented by a neural network that takes the state as input and outputs the parameters of the action distribution (e.g., probabilities for discrete actions, or mean and standard deviation for continuous actions). The parameters then correspond to the weights and biases of this neural network. This integration with function approximation allows policy gradient methods to scale to complex problems with high-dimensional state and action spaces, forming the foundation of many advanced techniques explored later in this course.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with