Okay, we've seen the Bellman equations, which give us a recursive relationship for value functions. The Bellman optimality equations, in particular, define the best possible value function (V∗ or Q∗). But how do we find that optimal value function and its corresponding optimal policy π∗?
Dynamic Programming (DP) offers a class of algorithms that can compute optimal policies, provided we have a complete model of the environment. This means we need to know the transition probabilities p(s′,r∣s,a) for all states s, actions a, next states s′, and rewards r. This is a strong assumption, often not met in practice (which is why we'll study model-free methods later!), but DP provides a solid theoretical foundation.
One fundamental DP method is Policy Iteration. It finds the optimal policy by alternating between two main steps: evaluating the current policy and then improving it. Let's break it down.
Imagine you start with some arbitrary policy, let's call it π0. It might be completely random, or it might be based on some simple heuristic. Policy Iteration aims to iteratively refine this policy until it converges to the optimal one, π∗. The process looks like this:
π0EvaluateVπ0Improveπ1EvaluateVπ1Improveπ2Evaluate⋯Improveπ∗EvaluateV∗
This involves two core components repeated in a loop:
This cycle continues until the policy no longer changes between iterations, indicating that we've found the optimal policy.
In this step, we take the current policy π and compute the state-value function Vπ(s) for all states s. Remember the Bellman expectation equation for Vπ:
Vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]For a finite MDP, this equation gives us a system of linear equations, one for each state. If there are ∣S∣ states, we have ∣S∣ equations and ∣S∣ unknowns (Vπ(s) for each s). We could solve this system directly.
However, an often simpler and more generally applicable approach, especially within the iterative nature of policy iteration, is to compute Vπ iteratively. We start with an arbitrary value function V0 (e.g., all zeros) and repeatedly apply the Bellman expectation equation as an update rule:
Vk+1(s)←a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γVk(s′)]for all s∈SWe sweep through all states, updating their values based on the values from the previous iteration (Vk). This process is guaranteed to converge to the true Vπ as k→∞. In practice, we stop when the changes in the value function become very small between iterations (i.e., when ∥Vk+1−Vk∥∞<θ for some small threshold θ). This iterative approach is essentially applying the Bellman operator repeatedly.
Once we have a reasonably accurate estimate of Vπ for our current policy π, we ask: "Can we improve this policy?" The policy improvement theorem gives us a way to do this.
For a given state s, we can look ahead one step and consider the expected return if we choose an action a (possibly different from π(a∣s)) and then follow the existing policy π thereafter. This is precisely the definition of the action-value function Qπ(s,a):
Qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]Now, suppose for some state s, we find an action a such that Qπ(s,a)>Vπ(s). This means that taking action a once in state s and then following π is better than just following π from the start. The policy improvement theorem states that if we construct a new policy π′ that chooses this "better" action a in state s (and follows π elsewhere, or better yet, acts greedily everywhere), the new policy π′ will be strictly better than or equal to π.
The most common way to improve the policy is to make it greedy with respect to Vπ. That is, for each state s, the new policy π′(s) deterministically chooses the action a that maximizes Qπ(s,a):
π′(s)←aargmaxQπ(s,a)=aargmaxs′,r∑p(s′,r∣s,a)[r+γVπ(s′)]This new greedy policy π′ is guaranteed to be at least as good as π. If Vπ′=Vπ, then we have reached the optimal value function V∗ and the optimal policy π∗, because we satisfy the Bellman optimality equation. If Vπ′=Vπ, then π′ is strictly better, and we loop back to evaluate this improved policy.
Putting it together, the Policy Iteration algorithm proceeds as follows:
policy_stable
← trueold_action
←π(s) (assuming deterministic policy for simplicity)old_action
=π(s): policy_stable
← falsepolicy_stable
is true: Stop. The current policy π and value function V are optimal (π∗ and V∗).The cycle of Policy Evaluation and Policy Improvement in Policy Iteration. The process continues until the policy no longer changes after the improvement step.
Policy Iteration is guaranteed to converge to the optimal policy π∗ in a finite number of iterations for a finite MDP. This is because there are only a finite number of possible policies, and each policy improvement step results in a strictly better policy (unless it's already optimal). Since the value function associated with each policy increases or stays the same, and values are bounded, the process must terminate.
The main drawback? Each step, especially policy evaluation, can be computationally expensive, requiring potentially many sweeps through the entire state space. Furthermore, it critically depends on having the complete MDP model (p(s′,r∣s,a)), which is often unavailable in real-world problems. This limitation motivates the model-free methods we will explore in subsequent chapters.
© 2025 ApX Machine Learning