In the previous section, we examined the Bellman expectation equation, which provides a recursive relationship for the value function (Vπ) and action-value function (Qπ) under a specific, given policy π. It tells us the expected return if we follow that particular policy. However, the ultimate goal in many reinforcement learning problems isn't just to evaluate a policy, but to find the best policy, the one that achieves the maximum possible cumulative reward.
To find this optimal policy, we need a way to describe the value functions associated with it. This leads us to the Bellman optimality equations. These equations define the value functions not for an arbitrary policy π, but for the optimal policy π∗.
First, let's define what we mean by optimal value functions. The optimal state-value function, denoted V∗(s), represents the maximum expected return achievable starting from state s and following any policy thereafter. Similarly, the optimal action-value function, Q∗(s,a), is the maximum expected return achievable starting from state s, taking action a, and then following the optimal policy.
Mathematically, they are defined as:
V∗(s)=maxπVπ(s) Q∗(s,a)=maxπQπ(s,a)
These functions represent the upper bound on performance for the given MDP.
The Bellman optimality equation for V∗ expresses the idea that the optimal value of a state s must equal the expected return obtained by taking the best possible action a in that state, and then continuing optimally from the resulting state s′. It incorporates a maximization step over actions, unlike the expectation equation which averages according to the policy π.
The equation is:
V∗(s)=maxa∈A(s)∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
Let's break this down:
This equation states that the value of a state under an optimal policy is the expected return for the best action from that state.
Similarly, we can define the Bellman optimality equation for the optimal action-value function Q∗. The optimal value of taking action a in state s is the expected immediate reward r plus the discounted expected value of the best possible action a′ taken in the next state s′.
The equation is:
Q∗(s,a)=∑s′,rp(s′,r∣s,a)[r+γmaxa′∈A(s′)Q∗(s′,a′)]
Here's the breakdown:
Notice the relationship between V∗ and Q∗. The optimal value of a state is simply the value of the best action taken from that state:
V∗(s)=maxaQ∗(s,a)
Substituting this into the Bellman optimality equation for Q∗ gives an alternative form:
Q∗(s,a)=∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
The Bellman optimality equations are significant because if we can solve them to find V∗ or Q∗, we can easily determine the optimal policy π∗. Specifically, if we have Q∗(s,a), the optimal policy is to act greedily with respect to Q∗ in every state:
π∗(s)=argmaxa∈A(s)Q∗(s,a)
This means that in state s, the optimal policy π∗ chooses the action a that maximizes the optimal action-value function Q∗(s,a). If we only have V∗, we can still find the optimal action by looking one step ahead, using the dynamics p(s′,r∣s,a):
π∗(s)=argmaxa∈A(s)∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
It's important to distinguish the Bellman optimality equations from the Bellman expectation equations.
The optimality equations define a system of non-linear equations. Unlike the expectation equations (which are linear for a fixed policy), solving the optimality equations directly can be more complex. However, they form the foundation for algorithms like Value Iteration, which we will discuss next, that aim to find these optimal values when a model of the environment is available. These equations capture the essence of finding the best way to behave in an MDP.
© 2025 ApX Machine Learning