In the previous chapter, we formalized sequential decision-making problems using Markov Decision Processes (MDPs). We defined states, actions, rewards, and the goal of maximizing cumulative future rewards (return). A central idea was the concept of value functions, which quantify how good it is for an agent to be in a particular state (Vπ(s)) or to take a specific action in a state (Qπ(s,a)) while following a policy π.
This chapter focuses on methods for calculating these value functions. We will introduce the Bellman equations, which express the value of a state or state-action pair in terms of the expected values of successor states. These equations form the basis for many RL algorithms.
Specifically, you will learn about:
By the end of this chapter, you will understand how to theoretically compute optimal value functions and policies when the environment's dynamics are fully known.
3.1 The Bellman Expectation Equation
3.2 The Bellman Optimality Equation
3.3 Solving Bellman Equations (Overview)
3.4 Dynamic Programming: Policy Iteration
3.5 Dynamic Programming: Value Iteration
3.6 Limitations of Dynamic Programming
© 2025 ApX Machine Learning