Previous sections reviewed value-based methods like Q-Learning, where the core idea is to learn the value of state-action pairs, Q(s,a), and then derive a policy (often implicitly, like epsilon-greedy) based on these values. Policy Gradient methods offer a fundamentally different approach: they directly parameterize the policy and optimize its parameters to maximize expected return.
Let's denote the policy as πθ(a∣s), representing the probability of taking action a in state s under policy parameters θ. The goal is to find the parameters θ that maximize a performance objective function, typically the expected total discounted reward starting from an initial state distribution. We can define this objective as:
J(θ)=Eτ∼πθ[t=0∑Tγtr(st,at)]Here, τ=(s0,a0,r1,s1,a1,r2,…) represents a trajectory generated by following policy πθ, γ is the discount factor, and r(st,at) is the reward received after taking action at in state st.
The core challenge is computing the gradient of this objective function with respect to the policy parameters, ∇θJ(θ). Directly differentiating the objective seems difficult because the expectation depends on the distribution of trajectories induced by πθ, which itself depends on θ. The Policy Gradient Theorem provides a convenient way around this. It states that the gradient of the objective function can be expressed as:
∇θJ(θ)=Eτ∼πθ[t=0∑T∇θlogπθ(at∣st)Qπθ(st,at)]Alternatively, using the expectation notation over states and actions visited under πθ:
∇θJ(θ)=Es∼dπθ,a∼πθ(a∣s)[∇θlogπθ(a∣s)Qπθ(s,a)]where dπθ(s) is the stationary state distribution under policy πθ. The term ∇θlogπθ(a∣s) is sometimes called the "score function". It tells us how to adjust the parameters θ to increase the log-probability of taking action a in state s. The theorem tells us to weight this direction by the action-value function Qπθ(s,a). Intuitively, this makes sense: we want to increase the probability of actions that lead to higher-than-average returns and decrease the probability of actions that lead to lower-than-average returns.
The Policy Gradient Theorem provides the form of the gradient, but we still need Qπθ(st,at), which is generally unknown. The REINFORCE algorithm (also known as Monte Carlo Policy Gradient) offers a simple way to estimate this gradient using sampled trajectories.
REINFORCE uses the complete return from time step t onwards, Gt=∑k=tTγk−trk+1, as an unbiased sample estimate of Qπθ(st,at). Since we are sampling trajectories, we can approximate the expectation with an average over sampled trajectories. For a single trajectory, the update direction for each time step t is Gt∇θlogπθ(at∣st).
The basic REINFORCE algorithm proceeds as follows:
Here's a visual representation of the REINFORCE update logic:
A schematic overview of the REINFORCE algorithm: generating a trajectory using the current policy πθ, calculating returns Gt and score functions ∇logπθ(at∣st) for each step, and then updating the policy parameters θ.
Policy gradient methods like REINFORCE have several advantages:
However, REINFORCE also suffers from a significant disadvantage:
This high variance is a primary motivation for developing more advanced policy gradient methods, such as Actor-Critic algorithms (covered in Chapter 3), which use a learned value function (the critic) to reduce the variance of the policy gradient estimate (from the actor).
Just as value functions can be approximated using linear functions or neural networks for large state spaces, the policy πθ(a∣s) can also be represented using function approximators. In deep reinforcement learning, πθ is typically represented by a neural network that takes the state s as input and outputs the parameters of the action distribution (e.g., probabilities for discrete actions, or mean and standard deviation for continuous actions). The parameters θ then correspond to the weights and biases of this neural network. This integration with function approximation allows policy gradient methods to scale to complex problems with high-dimensional state and action spaces, forming the bedrock of many advanced techniques explored later in this course.
© 2025 ApX Machine Learning