In the previous sections, we introduced the Bellman expectation and optimality equations. These equations provide powerful recursive relationships that characterize the value functions (Vπ,Qπ) for a given policy π and the optimal value functions (V∗,Q∗) for the best possible policy π∗. For example, the Bellman expectation equation for Vπ is:
Vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]And the Bellman optimality equation for V∗ is:
V∗(s)=amaxs′,r∑p(s′,r∣s,a)[r+γV∗(s′)]While these equations define the value functions, they don't immediately give us the values themselves. Our goal now is to find methods to solve these equations, meaning we want to compute the actual numerical values for Vπ(s) or V∗(s) for all states s.
Think about the Bellman expectation equation for Vπ. If we know the policy π and the environment dynamics p(s′,r∣s,a), this equation defines a system of linear equations. For a system with ∣S∣ states, there are ∣S∣ equations, one for each Vπ(s), and ∣S∣ unknowns (the values Vπ(s) for all s∈S). In principle, if ∣S∣ is small, we could solve this system directly using standard linear algebra techniques like matrix inversion. However, in many interesting problems, the number of states ∣S∣ can be enormous (millions or even infinite), making direct solutions computationally infeasible (matrix inversion is typically O(∣S∣3) complexity).
The Bellman optimality equation introduces a further challenge: the max operator makes the system non-linear. Standard linear equation solvers cannot be applied directly to find V∗.
Therefore, we typically turn to iterative methods. These approaches start with an initial guess for the value function and then repeatedly apply updates based on the Bellman equations, progressively refining the estimates until they converge to the true values.
The primary family of algorithms designed for solving Bellman equations when a complete model of the environment is known falls under the umbrella of Dynamic Programming (DP). DP methods leverage the structure of the Bellman equations to find exact solutions, assuming you have perfect knowledge of the transition probabilities p(s′,r∣s,a) and rewards.
There are two main DP approaches we will examine:
Policy Iteration: This method iterates between two steps:
Value Iteration: This method directly finds the optimal value function V∗ by iteratively applying the Bellman optimality equation update rule to the value estimates for all states. It combines aspects of policy evaluation and improvement into a single update step. Once the value function converges to V∗, the optimal policy π∗ can be easily extracted.
Iterative nature of Dynamic Programming methods for solving Bellman equations, requiring the MDP model as input.
These DP methods form the theoretical foundation for solving MDPs. The following sections will provide detailed explanations and algorithms for Policy Iteration and Value Iteration. Keep in mind their significant requirement: they need a complete and accurate model of the environment's dynamics. This assumption often doesn't hold in practice, which motivates the model-free Reinforcement Learning techniques discussed in later chapters.
© 2025 ApX Machine Learning