The state-value function and the action-value function represent the expected return when starting from state , or starting from state and taking action , respectively, and following policy thereafter. A method is needed to compute these values. The Bellman equations provide this mechanism by relating the value of a state or state-action pair to the values of its potential successor states. They establish a recursive relationship that is fundamental to understanding and solving reinforcement learning problems.
The Bellman expectation equation specifically describes this relationship for a given policy . It expresses the value function in terms of the expected immediate reward plus the discounted expected value of the next state. Let's examine this for both and .
Recall the definition of the state-value function: where is the total discounted return starting from time . We can rewrite the return recursively: Substituting this into the definition of : Using the linearity of expectation, we separate this into two parts: To evaluate these expectations, we need to consider the actions the agent might take according to its policy , and the possible resulting next states and rewards determined by the environment's dynamics .
The first term, , is the expected immediate reward. The agent first chooses an action based on , and then the environment responds with a next state and reward based on . Summing over all possibilities: The second term, , involves the expected discounted return from the next state . The expectation averages over the actions taken from state and the resulting next states . The expected return starting from a next state is simply . Therefore: Note that the reward in doesn't affect the value of , so we could also sum over .
Putting both terms back together gives the Bellman expectation equation for :
This equation states that the value of being in state under policy is the average over all actions (weighted by the probability of taking them) of the expected immediate reward plus the discounted expected value of the subsequent state (averaged over all possible and weighted by ).
It's often helpful to think of this relationship in terms of the action-value function . The value is simply the expected value of over the actions that policy might choose in state :
The value of a state is the expected value of the action values , averaged over the actions chosen by the policy .
We can derive a similar equation for the action-value function . Recall its definition: Again, substitute : Now, the expectation is conditioned on having already taken action in state . The environment then determines the next state and reward according to .
The first term is the expected immediate reward after taking action in state : The second term involves the expected discounted return starting from the next state . Since the agent follows policy from state onwards, the expected return is . Averaging over all possible next states and rewards : Combining these gives the Bellman expectation equation for in terms of :
This tells us that the value of taking action in state is the expected immediate reward plus the discounted expected value of the next state, averaged over all possible next states and rewards .
The value of a state-action pair is the expected value over possible next states and rewards . Each outcome occurs with probability and contributes its immediate reward plus the discounted value of the next state to the expectation.
We can also express fully in terms of by substituting into the equation:
For a finite MDP, the Bellman expectation equation for gives a system of linear equations in unknowns (the values for all ). Similarly, the equation for gives a system of linear equations in unknowns (the values for all ).
These equations are fundamental because they define a consistency condition. The value of a state (or state-action pair) must be consistent with the expected value derived from its possible successors under the policy . If we know the policy and the environment dynamics (the MDP), we can, in principle, solve this system of equations to find the exact value function or . This process is known as policy evaluation.
The Bellman expectation equations are the foundation upon which methods like Dynamic Programming (covered next) build to evaluate policies and find optimal ones, assuming a perfect model of the environment is available. They also inspire model-free methods like Temporal-Difference learning (discussed in later chapters) that estimate these values from experience.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with