Having defined the components of a Markov Decision Process (MDP) in the previous section, we now turn our attention to evaluating the "goodness" of different strategies, or policies, within that framework. This evaluation is central to reinforcement learning, as our ultimate goal is often to find the best possible policy. The Bellman equations provide the mathematical foundation for relating the value of a state (or state-action pair) to the values of subsequent states, forming the bedrock upon which many RL algorithms are built.
Let's assume we have a fixed policy π, which is a mapping from states to probabilities of taking each possible action π(a∣s). How valuable is it to be in a particular state s while following this policy? We quantify this using the state-value function, Vπ(s), defined as the expected total discounted reward starting from state s and subsequently following policy π:
Vπ(s)=Eπ[∑k=0∞γkRt+k+1∣St=s]
Here, Eπ[⋅] denotes the expected value given that the agent follows policy π, Rt+k+1 is the reward received at timestep t+k+1, and γ∈[0,1] is the discount factor that trades off immediate versus future rewards.
The Bellman expectation equation provides a recursive definition for Vπ(s). It breaks down the value of a state into the expected immediate reward and the expected discounted value of the next state:
Vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γVπ(s′)]
Let's dissect this equation:
Similarly, we can define the action-value function, Qπ(s,a), which represents the expected return starting from state s, taking action a, and then following policy π:
Qπ(s,a)=Eπ[∑k=0∞γkRt+k+1∣St=s,At=a]
The Bellman expectation equation for Qπ(s,a) is:
Qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γ∑a′π(a′∣s′)Qπ(s′,a′)]
This equation states that the value of taking action a in state s is the expected immediate reward plus the expected discounted value of the next state-action pair, averaged over all possible next actions a′ chosen according to policy π in the next state s′.
Notice the direct relationship between Vπ and Qπ:
For a given policy π and known dynamics p(s′,r∣s,a), the Bellman expectation equations define a system of linear equations. If the state space is small enough, we could solve this system directly to find the value function Vπ or Qπ corresponding to the policy π.
Recursive calculation for Qπ(s,a). The value depends on the expected immediate reward r and the discounted expected value of the next state-action pair Qπ(s′,a′), averaged over environment dynamics and the policy's choices in the next state.
While evaluating a given policy is useful, the ultimate objective in many RL settings is to find an optimal policy, denoted π∗, that achieves the highest possible expected return from any starting state. We define the optimal state-value function V∗(s) and the optimal action-value function Q∗(s,a) as:
V∗(s)=maxπVπ(s) Q∗(s,a)=maxπQπ(s,a)
The Bellman optimality equations define the value of the optimal policy. Unlike the expectation equations, they incorporate a maximization step, reflecting the fact that an optimal policy must choose the best action in each state.
The Bellman optimality equation for V∗(s) is:
V∗(s)=maxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
This equation states that the optimal value of a state s is obtained by choosing the action a that maximizes the expected sum of the immediate reward r and the discounted optimal value γV∗(s′) of the resulting next state s′.
The Bellman optimality equation for Q∗(s,a) is often more convenient for decision-making:
Q∗(s,a)=∑s′,rp(s′,r∣s,a)[r+γmaxa′Q∗(s′,a′)]
Here, the optimal value of taking action a in state s is the expected immediate reward plus the expected discounted value of the best possible action a′ available in the next state s′. The term maxa′Q∗(s′,a′) is equivalent to V∗(s′), highlighting the connection:
V∗(s)=maxaQ∗(s,a)
Crucially, the Bellman optimality equations form a system of non-linear equations due to the max operator. Solving this system yields the optimal value functions V∗ and Q∗. Once we have Q∗, finding the optimal policy π∗ is straightforward: in any state s, act greedily with respect to Q∗. That is, choose the action a that maximizes Q∗(s,a):
π∗(s)=argmaxaQ∗(s,a) (If multiple actions share the maximum value, any of them can be chosen.)
These optimality equations don't directly tell us how to find V∗ or Q∗, but they define the conditions that the optimal values must satisfy. Algorithms like Value Iteration and Policy Iteration (discussed next) leverage these equations to compute optimal policies in settings where the model p(s′,r∣s,a) is known. For model-free settings, methods like Q-Learning approximate Q∗ iteratively based on experience. Understanding Bellman equations is therefore essential for grasping nearly all fundamental reinforcement learning algorithms.
© 2025 ApX Machine Learning