To build a solid foundation for advanced reinforcement learning methods, this chapter provides a focused review of core concepts. We begin by recapping the Markov Decision Process (MDP) framework. Key elements like the Bellman equations for value functions (Vπ(s)) and action-value functions (Qπ(s,a)) are revisited, along with dynamic programming solutions (Value and Policy Iteration). We then summarize essential Temporal Difference (TD) learning algorithms, including Q-Learning and SARSA, and the fundamentals of Policy Gradient methods via the REINFORCE algorithm. Importantly, this chapter introduces the concept of function approximation, explaining its necessity for handling large state spaces and highlighting the potential instability known as the 'Deadly Triad' when combining off-policy learning, function approximation, and bootstrapping. This prepares you for the deep learning approaches used throughout the remainder of the course.
© 2025 ApX Machine Learning