In the preceding chapters, we explored methods centered around learning the value of being in a particular state or taking a specific action in a state. Algorithms like Q-learning and SARSA first estimate value functions (like Q(s,a)) and then use these estimates to determine the best actions, often by selecting the action with the highest estimated value. This approach is powerful but faces challenges, especially when dealing with continuous action spaces or when the optimal behavior itself is inherently random (stochastic).
Policy gradient methods offer an alternative perspective. Instead of learning value functions, we directly learn the policy itself. We define the policy as a parameterized function, let's denote it as πθ(a∣s), which outputs the probability of taking action a in state s. The parameters of this function are represented by θ. For example, θ could be the weights and biases of a neural network.
The core idea is to adjust the parameters θ directly to improve the quality of the policy. But what does "improve" mean? In reinforcement learning, our objective is typically to maximize the expected total discounted return. We can define a performance measure, often denoted as J(θ), which represents this expected return achieved by following policy πθ. For episodic tasks, this is often the value of the starting state:
J(θ)=Vπθ(s0)=Eπθ[t=0∑TγtRt+1∣S0=s0]Our goal becomes finding the parameters θ that maximize this performance measure J(θ).
Parameterizing the policy directly offers several advantages:
Since we have a parameterized function πθ and an objective function J(θ) we want to maximize, this looks like an optimization problem. We can use techniques similar to those used in supervised learning, but instead of gradient descent (to minimize loss), we use gradient ascent (to maximize performance).
The basic update rule for the policy parameters θ is:
θk+1=θk+α∇θJ(θk)Here:
By repeatedly calculating the gradient and taking steps in that direction, we iteratively improve the policy parameters θ to achieve higher expected returns.
How do we actually represent πθ(a∣s)? The choice depends on the nature of the action space.
Discrete Actions: A common approach is to use a function approximator (like a neural network) that takes the state s as input and outputs a score or preference h(s,a,θ) for each possible action a. These preferences can then be converted into probabilities using a softmax function:
πθ(a∣s)=∑a′eh(s,a′,θ)eh(s,a,θ)The parameters θ are the parameters of the function approximator (e.g., the network weights).
Continuous Actions: A popular method is to have the function approximator output the parameters of a probability distribution. For instance, for a single continuous action, the network might output the mean μθ(s) and standard deviation σθ(s) of a Gaussian distribution. Actions are then sampled from this distribution:
a∼N(μθ(s),σθ(s)2)The policy probability density is then given by the Gaussian probability density function (PDF). The parameters θ are again the parameters of the network determining μ and σ.
The main challenge, and the focus of much of this chapter, lies in estimating the gradient ∇θJ(θ). Unlike supervised learning where we have explicit labels, in RL, the "correctness" of an action is only revealed through the subsequent rewards received over potentially long trajectories. The Policy Gradient Theorem, which we introduce next, provides a theoretical foundation for calculating or estimating this crucial gradient.
© 2025 ApX Machine Learning