We've established how to model sequential decision problems using MDPs and how to evaluate a given policy π using its value functions, Vπ(s) and Qπ(s,a). Remember, Vπ(s) represents the expected return starting from state s and following policy π, while Qπ(s,a) is the expected return starting from state s, taking action a, and subsequently following policy π.
The ultimate objective in solving an MDP is not just to evaluate any policy, but to find the best policy. What does "best" mean? In the context of RL and MDPs, it means maximizing the expected cumulative discounted reward. We call such a policy an optimal policy, typically denoted as π∗.
Formally, a policy π is considered optimal if its expected return is greater than or equal to the expected return of any other policy π′ for all states s∈S.
Vπ(s)≥Vπ′(s)for all s∈S and all policies π′
It's important to note that there might be more than one optimal policy, but they all achieve the maximum possible expected return. All optimal policies share the same optimal state-value function, denoted V∗, and the same optimal action-value function, denoted Q∗.
These optimal value functions are defined as:
V∗(s)=maxπVπ(s)for all s∈S
Q∗(s,a)=maxπQπ(s,a)for all s∈S,a∈A
V∗(s) gives the largest expected return achievable starting from state s, regardless of the specific optimal policy followed. Similarly, Q∗(s,a) gives the largest expected return achievable starting from state s, taking action a, and then following an optimal policy thereafter.
There's a direct and powerful relationship between the optimal action-value function Q∗ and an optimal policy π∗. If you knew the true Q∗ function, finding an optimal policy would be straightforward. You would simply choose the action that maximizes Q∗(s,a) in each state s. A policy that acts greedily with respect to the optimal action-value function Q∗ is guaranteed to be an optimal policy.
So, an optimal policy π∗ can be defined as:
π∗(s)=argmaxa∈AQ∗(s,a)
This means that in state s, an optimal policy selects the action a that yields the highest expected return, according to Q∗. If multiple actions share the maximum value, any of them can be chosen by the optimal policy.
Just as the value functions for an arbitrary policy π satisfy the Bellman expectation equations, the optimal value functions V∗ and Q∗ satisfy their own specific Bellman equations, known as the Bellman Optimality Equations. These equations provide a recursive definition for the optimal values.
The Bellman Optimality Equation for V∗ is:
V∗(s)=maxa∈AQ∗(s,a) V∗(s)=maxa∈A∑s′∈SP(s′∣s,a)[R(s,a,s′)+γV∗(s′)]
This equation expresses that the optimal value of a state s must equal the expected return obtained by taking the best possible action a from that state, and then following an optimal policy thereafter. The max
operator signifies that the agent chooses the action that maximizes the expected subsequent value.
Similarly, the Bellman Optimality Equation for Q∗ is:
Q∗(s,a)=∑s′∈SP(s′∣s,a)[R(s,a,s′)+γV∗(s′)] Since V∗(s′)=maxa′∈AQ∗(s′,a′), we can write: Q∗(s,a)=∑s′∈SP(s′∣s,a)[R(s,a,s′)+γmaxa′∈AQ∗(s′,a′)]
This equation states that the optimal value of taking action a in state s is the expected immediate reward R(s,a,s′) plus the discounted expected value of the next state s′, assuming that the agent acts optimally (maxa′Q∗(s′,a′)) from that next state onwards.
These optimality equations are fundamental. If we can find a value function V or Q that satisfies the corresponding Bellman optimality equation, then that value function is the optimal value function (V∗ or Q∗), and we can derive an optimal policy from it.
Solving an MDP essentially means finding V∗ or Q∗. While the Bellman optimality equations define these functions, they don't directly provide a solution method, especially since they often form a system of non-linear equations. The next chapter will introduce algorithms based on Dynamic Programming, like Value Iteration and Policy Iteration, which are designed to compute these optimal value functions, assuming we have a complete model of the MDP (i.e., we know the transition probabilities P and reward function R). Later chapters will explore how to find optimal policies even when such a model is unavailable.
© 2025 ApX Machine Learning