Okay, we know our goal is to adjust the parameters θ of our policy πθ(a∣s) to make the agent perform better, specifically to maximize the expected total return. Let's define an objective function J(θ) that represents this expected return, often the expected return starting from some initial state distribution. Our aim is to find the parameters θ that maximize J(θ).
How do we achieve this maximization? If our objective function J(θ) were simple, we might solve for the maximum directly. But here, J(θ) depends complexly on the interaction between the policy and the environment dynamics over potentially long sequences of states, actions, and rewards. A common approach for optimizing functions like this is gradient ascent. We want to update our parameters θ in the direction that increases J(θ) the most:
θnew←θold+α∇θJ(θ)Here, α is a learning rate, and ∇θJ(θ) is the gradient of the objective function with respect to the policy parameters. The challenge lies in calculating this gradient, ∇θJ(θ). How does changing the policy parameters θ influence the total expected future reward, considering that this reward depends on the sequence of states visited and actions taken, which themselves depend on θ?
This is where the Policy Gradient Theorem comes into play. It provides a fundamental insight and a practical way to compute or estimate this gradient without needing to know the environment's dynamics (like the state transition probabilities). The theorem establishes an analytical expression for the gradient ∇θJ(θ) that relates it directly to the policy πθ(a∣s) and the value associated with taking actions.
While the full derivation involves some calculus, the core result (in one common form) tells us that the gradient is proportional to an expectation:
∇θJ(θ)∝Eπθ[∇θlogπθ(a∣s)⋅Gt]Let's break down the components inside the expectation Eπθ[⋅], which is taken over trajectories generated by following the current policy πθ:
So, the policy gradient theorem connects the gradient of the overall performance J(θ) to an expectation involving two terms:
The Intuition:
The theorem tells us that to increase the expected return, we should adjust the policy parameters θ based on the actions taken and the returns received. Specifically:
The expectation Eπθ[⋅] means we average this effect over all the state-action pairs encountered while following the policy πθ. On average, the policy parameters shift to increase the probability of actions leading to good outcomes and decrease the probability of actions leading to bad outcomes.
Why is this important?
The Policy Gradient Theorem is significant because it reformulates the gradient in a way that we can estimate using samples collected from the agent interacting with the environment. We don't need a model of the environment (i.e., the transition probabilities p(s′,r∣s,a)). We just need to run the policy, collect trajectories (sequences of states, actions, rewards), calculate the returns Gt for each step, compute ∇θlogπθ(a∣s) (which we can do since we defined the policy πθ), and average the products.
This theorem forms the theoretical bedrock for a family of RL algorithms, including the REINFORCE algorithm, which we will discuss next. These algorithms directly learn the policy parameters by sampling experiences and applying updates based on this gradient estimation principle.
© 2025 ApX Machine Learning