Okay, we've established that value-based methods like DQN have limitations, especially with continuous actions or when stochastic policies are needed. Policy gradient methods offer a different path: directly optimizing the parameters of the policy itself. But how do we actually do that? How do we know which way to adjust the policy parameters θ to improve performance?
Our goal is to find policy parameters θ that maximize the expected total return. We can define an objective function, often denoted as J(θ), which represents the expected return when following policy πθ. For instance, in an episodic task, this could be the expected cumulative discounted reward starting from an initial state distribution:
J(θ)=Eτ∼πθ[t=0∑T−1γtrt+1]=Eτ∼πθ[G0]Here, τ represents a full trajectory (s0,a0,r1,s1,...) generated by following the policy πθ(a∣s), γ is the discount factor, rt+1 is the reward received after taking action at in state st, and G0 is the total discounted return for the entire episode.
To maximize J(θ), we can use gradient ascent. The update rule for the policy parameters would look like this:
θ←θ+α∇θJ(θ)where α is the learning rate and ∇θJ(θ) is the gradient of the objective function with respect to the policy parameters. The core challenge lies in calculating this gradient, ∇θJ(θ). How does changing the policy parameters θ affect the expected total return? This involves not only how the policy chooses actions but also how the resulting trajectories and reward distributions change, which seems complicated to compute directly.
This is where the Policy Gradient Theorem comes in. It provides a way to compute this gradient without needing to know or differentiate the environment's dynamics (the transition probabilities p(s′∣s,a)). This theorem is foundational for all policy gradient algorithms.
One common form of the Policy Gradient Theorem states that the gradient of the objective function is given by:
∇θJ(θ)=Eτ∼πθ[t=0∑T−1∇θlogπθ(at∣st)Gt]Let's break down this important expression:
You might wonder how we arrived at the ∇θlogπθ(at∣st) term. It comes from a useful identity sometimes called the "log-derivative trick" or "likelihood ratio trick". The gradient of the policy ∇θπθ(a∣s) can be rewritten as:
∇θπθ(a∣s)=πθ(a∣s)πθ(a∣s)∇θπθ(a∣s)=πθ(a∣s)∇θlogπθ(a∣s)This reformulation is significant because it allows us to express the gradient ∇θJ(θ) as an expectation involving πθ itself. This means we can estimate the gradient using samples (actions at) drawn from the current policy πθ, without needing to explicitly calculate how changes in θ affect the state visitation distribution.
The Policy Gradient Theorem provides an intuitive update mechanism. The term ∇θlogπθ(at∣st) indicates how to change θ to increase (or decrease) the probability of action at in state st. The term Gt acts as a weighting factor for this update direction.
Essentially, the policy gradient method reinforces actions proportionally to the total return they ultimately lead to.
The Policy Gradient Theorem provides the theoretical justification for algorithms like REINFORCE, which we will explore next. REINFORCE directly uses Monte Carlo sampling (running full episodes) to estimate the expectation in the theorem and update the policy parameters. As we'll see, while conceptually straightforward, this approach can suffer from high variance in the gradient estimates, motivating further refinements like using baselines.
© 2025 ApX Machine Learning