Previous chapters focused on methods like Deep Q-Networks (DQN) which learn the value Q(s,a) of taking actions in states. While effective, these value-based approaches can struggle in certain situations, such as environments with continuous action spaces or when a stochastic policy is inherently required.
This chapter introduces Policy Gradient methods, a fundamentally different strategy. Here, we directly learn a parameterized policy π(a∣s;θ) that selects actions, without relying on an intermediate value function estimate for action selection. We will begin by discussing the limitations of value-based methods that motivate this alternative approach.
We will then examine the core idea behind policy gradients: adjusting the policy parameters θ to maximize expected return. This involves understanding the Policy Gradient Theorem, the theoretical basis for these methods. You will learn to implement the REINFORCE algorithm, a foundational Monte Carlo policy gradient technique. We will also address the challenge of high variance often encountered with REINFORCE and introduce the concept of using baselines to improve stability and convergence speed. The chapter concludes with a practical implementation exercise for the REINFORCE algorithm.
4.1 Limitations of Value-Based Methods
4.2 Direct Policy Parameterization
4.3 The Policy Gradient Theorem
4.4 REINFORCE Algorithm
4.5 Understanding Variance in Policy Gradients
4.6 Baselines for Variance Reduction
4.7 Hands-on Practical: Implementing REINFORCE
© 2025 ApX Machine Learning