As we discussed, value-based methods like DQN, while powerful, have limitations. They primarily focus on learning accurate action-value estimates, Q(s,a), and then derive a policy (often ϵ-greedy) from these values. This indirect approach can become challenging in environments with vast, continuous action spaces, where finding the maximum Q-value involves an optimization step at each decision point. Furthermore, sometimes the optimal policy itself is inherently stochastic, meaning the agent should choose actions randomly according to a specific probability distribution, which deterministic policies derived from Q-values struggle to represent naturally.
Policy gradient methods offer a different path. Instead of learning values first, we directly learn the policy itself. We introduce a function, the policy function (or simply, the policy), that maps states to actions (or probabilities over actions). This function has its own set of parameters, which we'll denote collectively as θ. Our goal is to adjust these parameters θ to produce the best possible behavior.
We represent this parameterized policy as π(a∣s;θ). This notation signifies the probability (or probability density for continuous actions) of selecting action a when in state s, given the policy parameters θ.
How do we actually represent this function π(a∣s;θ)? Just like we used neural networks to approximate Q-values in DQN, we commonly use neural networks to represent the policy.
The weights and biases of this neural network constitute the parameters θ. Learning the policy means finding the values of θ that lead to the agent achieving the highest possible cumulative reward.
A diagram illustrating how a policy network takes a state as input and outputs parameters defining the action selection strategy. The parameters θ are the weights and biases within the network.
By parameterizing the policy directly, we shift the optimization problem. We are no longer trying to accurately estimate Q-values. Instead, we are searching for the policy parameters θ that maximize an objective function J(θ), which typically represents the expected total discounted return obtained by following policy π(a∣s;θ).
J(θ)=Eτ∼πθ[t=0∑TγtR(st,at)]Here, τ represents a trajectory (a sequence of states and actions) generated by following the policy πθ, γ is the discount factor, and R(st,at) is the reward received. Our goal is to find the θ that makes J(θ) as large as possible.
This direct parameterization offers several advantages:
max
operation required in Q-learning.Now that we understand how to represent the policy using parameters θ, the next step is to determine how to adjust these parameters effectively to improve the agent's performance. This leads us to the core mathematical foundation of these methods: the Policy Gradient Theorem.
© 2025 ApX Machine Learning