Adjusting the parameters of a policy aims to make an agent perform better, specifically to maximize the expected total return. An objective function represents this expected return, often defined as the expected return starting from some initial state distribution. The objective is to find the parameters that maximize .
How do we achieve this maximization? If our objective function were simple, we might solve for the maximum directly. But here, depends complexly on the interaction between the policy and the environment dynamics over potentially long sequences of states, actions, and rewards. A common approach for optimizing functions like this is gradient ascent. We want to update our parameters in the direction that increases the most:
Here, is a learning rate, and is the gradient of the objective function with respect to the policy parameters. The challenge lies in calculating this gradient, . How does changing the policy parameters influence the total expected future reward, considering that this reward depends on the sequence of states visited and actions taken, which themselves depend on ?
This is where the Policy Gradient Theorem comes into play. It provides a fundamental insight and a practical way to compute or estimate this gradient without needing to know the environment's dynamics (like the state transition probabilities). The theorem establishes an analytical expression for the gradient that relates it directly to the policy and the value associated with taking actions.
While the full derivation involves some calculus, the core result (in one common form) tells us that the gradient is proportional to an expectation:
Let's break down the components inside the expectation , which is taken over trajectories generated by following the current policy :
So, the policy gradient theorem connects the gradient of the overall performance to an expectation involving two terms:
The Intuition:
The theorem tells us that to increase the expected return, we should adjust the policy parameters based on the actions taken and the returns received. Specifically:
The expectation means we average this effect over all the state-action pairs encountered while following the policy . On average, the policy parameters shift to increase the probability of actions leading to good outcomes and decrease the probability of actions leading to bad outcomes.
Why is this important?
The Policy Gradient Theorem is significant because it reformulates the gradient in a way that we can estimate using samples collected from the agent interacting with the environment. We don't need a model of the environment (i.e., the transition probabilities ). We just need to run the policy, collect trajectories (sequences of states, actions, rewards), calculate the returns for each step, compute (which we can do since we defined the policy ), and average the products.
This theorem forms the theoretical foundation for a family of RL algorithms, including the REINFORCE algorithm, which we will discuss next. These algorithms directly learn the policy parameters by sampling experiences and applying updates based on this gradient estimation principle.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with