Okay, we've seen how Policy Iteration finds the optimal policy π∗ by iterating between evaluating the current policy and improving it. Now, let's look at another dynamic programming algorithm called Value Iteration (VI). Value Iteration provides a different, often more computationally efficient, way to find the optimal value function V∗ directly, bypassing the need for explicit policy evaluation steps within the main loop.
Recall the Bellman Optimality Equation for V∗(s):
V∗(s)=amaxs′,r∑p(s′,r∣s,a)[r+γV∗(s′)]This equation states that the optimal value of a state s is the expected return achieved by taking the best possible action a from that state, and then following the optimal policy thereafter.
Value Iteration leverages this equation directly as an update rule. It starts with an initial guess for the value function, V0(s), for all states (often initialized to zero). Then, it iteratively refines this estimate by applying the Bellman optimality update across all states until the value function converges.
Here's the process step-by-step:
Essentially, each iteration involves a sweep through the state space, updating each state's value based on the current values of its potential successor states, using the maximization over actions from the Bellman Optimality Equation. This update is sometimes called a "Bellman backup" or "optimality backup".
This diagram illustrates how the value Vk+1(s) is calculated based on the maximum expected future rewards obtainable from state s, using the values Vk(s′) from the previous iteration for successor states s′.
Why does this work? The Bellman optimality operator used in the update step is a contraction mapping. This mathematical property guarantees that applying the update repeatedly will cause the value function Vk to converge to a unique fixed point, which is precisely the optimal value function V∗. We won't go into the formal proof here, but the intuition is that each iteration brings the estimated values closer to the true optimal values.
Once Value Iteration converges (i.e., the changes in the value function Δ become very small), we have a good approximation of V∗. How do we get the optimal policy π∗ from V∗? We can extract it directly by choosing the action that maximizes the expected return from each state, looking one step ahead using the converged V∗:
For each state s:
π∗(s)=argamaxs′,r∑p(s′,r∣s,a)[r+γV∗(s′)]This step involves computing the expected value for each action in each state using the final value function V∗ and selecting the action yielding the highest value. Note that this policy extraction step is performed only once after the value function has converged.
In practice, Value Iteration often converges faster than Policy Iteration, especially when the number of iterations needed for policy evaluation in PI is large. However, each VI iteration requires calculating the maximum over actions, which can be computationally intensive if the action space is large.
Like Policy Iteration, Value Iteration falls under the umbrella of Dynamic Programming methods. It finds the optimal solution but relies heavily on having a complete and accurate model of the environment's dynamics (p(s′,r∣s,a)). This requirement is a significant limitation, which we'll address when we move to model-free learning methods in the upcoming chapters.
© 2025 ApX Machine Learning