So far, our focus has been on value-based reinforcement learning methods. We learned to estimate the value of states () or state-action pairs () and derived policies based on these values. This chapter introduces a distinct approach: policy gradient methods.
With policy gradient methods, we learn a parameterized policy, denoted , directly. Instead of estimating value functions first, we aim to optimize the policy parameters to maximize the expected return. This approach is particularly useful in environments with continuous action spaces or when we want to learn stochastic policies.
In this chapter, you will learn:
We will cover the algorithms conceptually and guide you through implementing a basic REINFORCE agent.
8.1 Learning Policies Directly
8.2 Policy Gradient Theorem (Concept)
8.3 REINFORCE Algorithm
8.4 Baselines for Variance Reduction
8.5 Actor-Critic Methods Overview
8.6 Comparing Value-Based and Policy-Based Methods
8.7 Practice: Implementing REINFORCE
© 2026 ApX Machine LearningEngineered with