To evaluate the effectiveness of different strategies, or policies, within a Markov Decision Process (MDP) framework, and ultimately to find the optimal policy, a mathematical foundation is essential. This evaluation is central to reinforcement learning. The Bellman equations provide this foundation, relating the value of a state (or state-action pair) to the values of subsequent states, and serving as the basis upon which many reinforcement learning algorithms are built.
Let's assume we have a fixed policy , which is a mapping from states to probabilities of taking each possible action . How valuable is it to be in a particular state while following this policy? We quantify this using the state-value function, , defined as the expected total discounted reward starting from state and subsequently following policy :
Here, denotes the expected value given that the agent follows policy , is the reward received at timestep , and is the discount factor that trades off immediate versus future rewards.
The Bellman expectation equation provides a recursive definition for . It breaks down the value of a state into the expected immediate reward and the expected discounted value of the next state:
Let's dissect this equation:
Similarly, we can define the action-value function, , which represents the expected return starting from state , taking action , and then following policy :
The Bellman expectation equation for is:
This equation states that the value of taking action in state is the expected immediate reward plus the expected discounted value of the next state-action pair, averaged over all possible next actions chosen according to policy in the next state .
Notice the direct relationship between and :
For a given policy and known dynamics , the Bellman expectation equations define a system of linear equations. If the state space is small enough, we could solve this system directly to find the value function or corresponding to the policy .
Recursive calculation for . The value depends on the expected immediate reward and the discounted expected value of the next state-action pair , averaged over environment dynamics and the policy's choices in the next state.
While evaluating a given policy is useful, the ultimate objective in many RL settings is to find an optimal policy, denoted , that achieves the highest possible expected return from any starting state. We define the optimal state-value function and the optimal action-value function as:
The Bellman optimality equations define the value of the optimal policy. Unlike the expectation equations, they incorporate a maximization step, reflecting the fact that an optimal policy must choose the best action in each state.
The Bellman optimality equation for is:
This equation states that the optimal value of a state is obtained by choosing the action that maximizes the expected sum of the immediate reward and the discounted optimal value of the resulting next state .
The Bellman optimality equation for is often more convenient for decision-making:
Here, the optimal value of taking action in state is the expected immediate reward plus the expected discounted value of the best possible action available in the next state . The term is equivalent to , highlighting the connection:
Crucially, the Bellman optimality equations form a system of non-linear equations due to the operator. Solving this system yields the optimal value functions and . Once we have , finding the optimal policy is straightforward: in any state , act greedily with respect to . That is, choose the action that maximizes :
(If multiple actions share the maximum value, any of them can be chosen.)
These optimality equations don't directly tell us how to find or , but they define the conditions that the optimal values must satisfy. Algorithms like Value Iteration and Policy Iteration (discussed next) leverage these equations to compute optimal policies in settings where the model is known. For model-free settings, methods like Q-Learning approximate iteratively based on experience. Understanding Bellman equations is therefore essential for grasping nearly all fundamental reinforcement learning algorithms.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with