Bellman equations provide a recursive relationship for value functions. The Bellman optimality equations, in particular, define the best possible value function ( or ). The central challenge is how to find that optimal value function and its corresponding optimal policy .
Dynamic Programming (DP) offers a class of algorithms that can compute optimal policies, provided we have a complete model of the environment. This means we need to know the transition probabilities for all states , actions , next states , and rewards . This is a strong assumption, often not met in practice (which is why we'll study model-free methods later!), but DP provides a solid theoretical foundation.
One fundamental DP method is Policy Iteration. It finds the optimal policy by alternating between two main steps: evaluating the current policy and then improving it. Let's break it down.
Imagine you start with some arbitrary policy, let's call it . It might be completely random, or it might be based on some simple heuristic. Policy Iteration aims to iteratively refine this policy until it converges to the optimal one, . The process looks like this:
This involves two core components repeated in a loop:
This cycle continues until the policy no longer changes between iterations, indicating that we've found the optimal policy.
In this step, we take the current policy and compute the state-value function for all states . Remember the Bellman expectation equation for :
For a finite MDP, this equation gives us a system of linear equations, one for each state. If there are states, we have equations and unknowns ( for each ). We could solve this system directly.
However, an often simpler and more generally applicable approach, especially within the iterative nature of policy iteration, is to compute iteratively. We start with an arbitrary value function (e.g., all zeros) and repeatedly apply the Bellman expectation equation as an update rule:
We sweep through all states, updating their values based on the values from the previous iteration (). This process is guaranteed to converge to the true as . In practice, we stop when the changes in the value function become very small between iterations (i.e., when for some small threshold ). This iterative approach is essentially applying the Bellman operator repeatedly.
Once we have a reasonably accurate estimate of for our current policy , we ask: "Can we improve this policy?" The policy improvement theorem gives us a way to do this.
For a given state , we can look ahead one step and consider the expected return if we choose an action (possibly different from ) and then follow the existing policy thereafter. This is precisely the definition of the action-value function :
Now, suppose for some state , we find an action such that . This means that taking action once in state and then following is better than just following from the start. The policy improvement theorem states that if we construct a new policy that chooses this "better" action in state (and follows elsewhere, or better yet, acts greedily everywhere), the new policy will be strictly better than or equal to .
The most common way to improve the policy is to make it greedy with respect to . That is, for each state , the new policy deterministically chooses the action that maximizes :
This new greedy policy is guaranteed to be at least as good as . If , then we have reached the optimal value function and the optimal policy , because we satisfy the Bellman optimality equation. If , then is strictly better, and we loop back to evaluate this improved policy.
Putting it together, the Policy Iteration algorithm proceeds as follows:
policy_stable trueold_action (assuming deterministic policy for simplicity)old_action : policy_stable falsepolicy_stable is true: Stop. The current policy and value function are optimal ( and ).The cycle of Policy Evaluation and Policy Improvement in Policy Iteration. The process continues until the policy no longer changes after the improvement step.
Policy Iteration is guaranteed to converge to the optimal policy in a finite number of iterations for a finite MDP. This is because there are only a finite number of possible policies, and each policy improvement step results in a strictly better policy (unless it's already optimal). Since the value function associated with each policy increases or stays the same, and values are bounded, the process must terminate.
"The main drawback? Each step, especially policy evaluation, can be computationally expensive, requiring potentially many sweeps through the entire state space. Furthermore, it critically depends on having the complete MDP model (), which is often unavailable in practical problems. This limitation motivates the model-free methods we will explore in subsequent chapters."
Was this section helpful?
© 2026 ApX Machine LearningEngineered with