In the previous chapter, we defined the state-value function Vπ(s) and the action-value function Qπ(s,a), which represent the expected return when starting from state s, or starting from state s and taking action a, respectively, and following policy π thereafter. Now, we need a way to compute these values. The Bellman equations provide this mechanism by relating the value of a state or state-action pair to the values of its potential successor states. They establish a recursive relationship that is fundamental to understanding and solving reinforcement learning problems.
The Bellman expectation equation specifically describes this relationship for a given policy π. It expresses the value function in terms of the expected immediate reward plus the discounted expected value of the next state. Let's examine this for both Vπ and Qπ.
Recall the definition of the state-value function: Vπ(s)=Eπ[Gt∣St=s] where Gt=Rt+1+γRt+2+γ2Rt+3+… is the total discounted return starting from time t. We can rewrite the return recursively: Gt=Rt+1+γGt+1 Substituting this into the definition of Vπ(s): Vπ(s)=Eπ[Rt+1+γGt+1∣St=s] Using the linearity of expectation, we separate this into two parts: Vπ(s)=Eπ[Rt+1∣St=s]+γEπ[Gt+1∣St=s] To evaluate these expectations, we need to consider the actions the agent might take according to its policy π(a∣s), and the possible resulting next states s′ and rewards r determined by the environment's dynamics p(s′,r∣s,a).
The first term, Eπ[Rt+1∣St=s], is the expected immediate reward. The agent first chooses an action a based on π(a∣s), and then the environment responds with a next state s′ and reward r based on p(s′,r∣s,a). Summing over all possibilities: Eπ[Rt+1∣St=s]=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)r The second term, γEπ[Gt+1∣St=s], involves the expected discounted return from the next state St+1. The expectation Eπ[Gt+1∣St=s] averages over the actions a taken from state s and the resulting next states s′. The expected return starting from a next state s′ is simply Vπ(s′). Therefore: Eπ[Gt+1∣St=s]=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)Vπ(s′) Note that the reward r in p(s′,r∣s,a) doesn't affect the value of Vπ(s′), so we could also sum over p(s′∣s,a)=∑rp(s′,r∣s,a).
Putting both terms back together gives the Bellman expectation equation for Vπ: Vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γVπ(s′)]
This equation states that the value of being in state s under policy π is the average over all actions a (weighted by the probability π(a∣s) of taking them) of the expected immediate reward plus the discounted expected value of the subsequent state s′ (averaged over all possible s′ and r weighted by p(s′,r∣s,a)).
It's often helpful to think of this relationship in terms of the action-value function Qπ(s,a). The value Vπ(s) is simply the expected value of Qπ(s,a) over the actions a that policy π might choose in state s: Vπ(s)=∑aπ(a∣s)Qπ(s,a)
The value of a state Vπ(s) is the expected value of the action values Qπ(s,a), averaged over the actions a chosen by the policy π.
We can derive a similar equation for the action-value function Qπ(s,a). Recall its definition: Qπ(s,a)=Eπ[Gt∣St=s,At=a] Again, substitute Gt=Rt+1+γGt+1: Qπ(s,a)=Eπ[Rt+1+γGt+1∣St=s,At=a] Qπ(s,a)=Eπ[Rt+1∣St=s,At=a]+γEπ[Gt+1∣St=s,At=a] Now, the expectation is conditioned on having already taken action a in state s. The environment then determines the next state s′ and reward r according to p(s′,r∣s,a).
The first term is the expected immediate reward after taking action a in state s: Eπ[Rt+1∣St=s,At=a]=∑s′,rp(s′,r∣s,a)r The second term involves the expected discounted return starting from the next state St+1=s′. Since the agent follows policy π from state s′ onwards, the expected return is Eπ[Gt+1∣St+1=s′]=Vπ(s′). Averaging over all possible next states s′ and rewards r: γEπ[Gt+1∣St=s,At=a]=γ∑s′,rp(s′,r∣s,a)Vπ(s′) Combining these gives the Bellman expectation equation for Qπ in terms of Vπ: Qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γVπ(s′)]
This tells us that the value of taking action a in state s is the expected immediate reward plus the discounted expected value of the next state, averaged over all possible next states s′ and rewards r.
The value of a state-action pair Qπ(s,a) is the expected value over possible next states s′ and rewards r. Each outcome (s′,r) occurs with probability p(s′,r∣s,a) and contributes its immediate reward r plus the discounted value of the next state γVπ(s′) to the expectation.
We can also express Qπ(s,a) fully in terms of Qπ by substituting Vπ(s′)=∑a′π(a′∣s′)Qπ(s′,a′) into the equation: Qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γ∑a′π(a′∣s′)Qπ(s′,a′)]
For a finite MDP, the Bellman expectation equation for Vπ gives a system of ∣S∣ linear equations in ∣S∣ unknowns (the values Vπ(s) for all s∈S). Similarly, the equation for Qπ gives a system of ∣S∣×∣A∣ linear equations in ∣S∣×∣A∣ unknowns (the values Qπ(s,a) for all s∈S,a∈A).
These equations are fundamental because they define a consistency condition. The value of a state (or state-action pair) must be consistent with the expected value derived from its possible successors under the policy π. If we know the policy π and the environment dynamics (the MDP), we can, in principle, solve this system of equations to find the exact value function Vπ or Qπ. This process is known as policy evaluation.
The Bellman expectation equations are the foundation upon which methods like Dynamic Programming (covered next) build to evaluate policies and find optimal ones, assuming a perfect model of the environment is available. They also inspire model-free methods like Temporal-Difference learning (discussed in later chapters) that estimate these values from experience.
© 2025 ApX Machine Learning