So far, our focus has been on value-based reinforcement learning methods. We learned to estimate the value of states (V(s)) or state-action pairs (Q(s,a)) and derived policies based on these values. This chapter introduces a distinct approach: policy gradient methods.
With policy gradient methods, we learn a parameterized policy, denoted πθ(a∣s), directly. Instead of estimating value functions first, we aim to optimize the policy parameters θ to maximize the expected return. This approach is particularly useful in environments with continuous action spaces or when we want to learn stochastic policies.
In this chapter, you will learn:
We will cover the algorithms conceptually and guide you through implementing a basic REINFORCE agent.
8.1 Learning Policies Directly
8.2 Policy Gradient Theorem (Concept)
8.3 REINFORCE Algorithm
8.4 Baselines for Variance Reduction
8.5 Actor-Critic Methods Overview
8.6 Comparing Value-Based and Policy-Based Methods
8.7 Practice: Implementing REINFORCE
© 2025 ApX Machine Learning