Basic policy gradient methods, like REINFORCE, often suffer from high variance in their gradient estimates, leading to slow or unstable learning. This chapter introduces Actor-Critic methods, a family of algorithms designed to address this limitation. The core idea is to maintain two components: an actor that learns the policy π(a∣s) and a critic that learns a value function (like V(s) or Q(s,a)) to evaluate the actor's actions and provide lower-variance gradient signals.
You will study several key advancements building on this framework:
By the end of this chapter, you will understand the theory behind these advanced algorithms and be prepared to implement them for solving more complex reinforcement learning problems.
3.1 Challenges in Basic Policy Gradients
3.2 Actor-Critic Architecture Fundamentals
3.3 Baselines for Variance Reduction
3.4 Advantage Actor-Critic (A2C) and A3C
3.5 Generalized Advantage Estimation (GAE)
3.6 Deep Deterministic Policy Gradient (DDPG)
3.7 Trust Region Policy Optimization (TRPO)
3.8 Proximal Policy Optimization (PPO)
3.9 Soft Actor-Critic (SAC)
3.10 Actor-Critic Methods Implementation Practice
© 2025 ApX Machine Learning