In previous chapters, we focused on value-based reinforcement learning algorithms like Q-Learning and its deep learning extension, DQN. These methods operate by first learning an action-value function, Q(s,a), which estimates the expected return of taking action a in state s and following a particular policy thereafter. The policy itself is often derived implicitly from these Q-values, typically by selecting the action with the highest estimated value (a greedy policy) or using a variation like epsilon-greedy for exploration.
While powerful, especially in environments with discrete action spaces, this value-based approach encounters significant difficulties in certain scenarios. Understanding these limitations helps clarify why directly learning a policy, as we'll do with policy gradient methods, can be advantageous.
One major challenge arises when dealing with environments where actions are continuous. Consider controlling a robot arm where the action might be the precise torque to apply to a joint, or setting the throttle of a vehicle. Actions here are real numbers within a range, meaning there are infinitely many possible actions.
Value-based methods like DQN rely on finding the action that maximizes the Q-value for a given state:
a∗=argamaxQ(s,a)In discrete action spaces, this maximization is straightforward: we simply compute the Q-value for each possible action (a finite set) and choose the action corresponding to the highest value. However, if the action space a is continuous (e.g., a∈R), performing this maximization step becomes a non-trivial optimization problem at every single step the agent needs to take an action. Running an optimization routine inside the agent's decision loop can be computationally expensive and slow.
One common workaround is to discretize the continuous action space. For instance, instead of allowing any torque value between -1.0 and 1.0, we might only allow actions like {-1.0, -0.5, 0.0, 0.5, 1.0}. While this makes the argmax operation feasible again, it introduces its own set of problems:
Policy gradient methods, as we will see, handle continuous action spaces more naturally by directly outputting parameters for a probability distribution over actions (e.g., the mean and standard deviation of a Gaussian distribution) or directly outputting the continuous action itself.
Value-based methods often lead to deterministic policies. Once the Q-values have converged, the greedy policy selects the single action with the highest Q-value in a given state. While epsilon-greedy introduces some randomness during exploration, the learned policy derived from the Q-function is typically deterministic.
However, in some environments, the optimal policy is inherently stochastic. A classic example is the game Rock-Paper-Scissors. Playing a deterministic strategy (e.g., always playing Rock) makes you predictable and easily exploitable. The optimal strategy is probabilistic: play Rock, Paper, and Scissors with equal probability (1/3 each).
Consider a simpler grid world scenario where two states look identical to the agent (aliased states), but require different actions to reach the goal efficiently. If the agent learns a deterministic policy, it might get stuck or perform poorly because it always takes the same action in both visually identical states. An optimal stochastic policy might involve randomly choosing between two actions in these aliased states, increasing the chance of success over time.
Value-based methods struggle to learn such stochastic policies directly. While techniques exist to derive stochastic policies from Q-values, it's often less natural than methods that explicitly parameterize and optimize a stochastic policy π(a∣s;θ).
In certain problems, the optimal policy itself might be much simpler to represent than its corresponding value function. Imagine an agent needing to balance a pole. The policy might be relatively simple: "if the pole tilts left, push left; if it tilts right, push right." However, the value of being in a specific state (pole angle, angular velocity, cart position, cart velocity) could be quite complex to estimate accurately, depending heavily on future states and actions under this policy.
Learning the intricate details of the value function V(s) or Q(s,a) might require a very complex function approximator (a large neural network) and extensive training, even if the optimal behavior is governed by simpler rules. In such cases, directly learning the parameters θ of the policy π(a∣s;θ) might be more efficient and require a simpler model. Policy gradient methods offer a way to optimize the policy parameters directly, potentially bypassing the need to learn a complex value function first.
These limitations highlight situations where value-based methods may not be the most effective or efficient approach. This motivates the study of policy gradient methods, which optimize the policy directly, offering a powerful alternative for tackling a wider range of reinforcement learning problems, particularly those involving continuous actions or requiring stochastic policies. We will now proceed to understand the theoretical foundations and practical implementation of these techniques.
© 2025 ApX Machine Learning