In previous chapters, we examined value-based methods like Deep Q-Networks (DQN), which learn action values, and policy gradient methods like REINFORCE, which directly optimize a policy. Both approaches have strengths and weaknesses. Value-based methods can be sample-efficient but struggle with continuous action spaces. Policy gradient methods handle continuous actions naturally but often suffer from high variance in their gradient estimates.
This chapter introduces Actor-Critic methods, a family of algorithms that combine aspects of both approaches. You will learn how these methods use two components:
We will examine how the critic's evaluations provide lower-variance learning signals for the actor, aiming for more stable and efficient training compared to pure policy gradient methods. We will study specific implementations like Advantage Actor-Critic (A2C) and its asynchronous variant (A3C), focusing on their architecture, update rules, and practical considerations. By the end of this chapter, you will understand the rationale behind Actor-Critic methods and how they address some limitations of earlier techniques.
5.1 Combining Policy and Value Estimation
5.2 Actor-Critic Architecture Overview
5.3 Advantage Actor-Critic (A2C)
5.4 Asynchronous Advantage Actor-Critic (A3C)
5.5 Implementation Considerations for Actor-Critic
5.6 Comparison: REINFORCE vs A2C/A3C
5.7 Practice: Conceptualizing an A2C Implementation
© 2025 ApX Machine Learning