To effectively navigate Reinforcement Learning, an agent needs a way to evaluate how good a particular situation is. This is where value functions come in. They quantify the expected long-term return (cumulative discounted reward) an agent can achieve. As briefly mentioned, there are two primary types of value functions we rely on: the state-value function V(s) and the action-value function Q(s,a).
The state-value function, denoted as Vπ(s), represents the expected return if the agent starts in state s and follows a specific policy π thereafter. Think of it as answering the question: "How good is it to be in state s if I follow policy π?"
Mathematically, it's defined as: Vπ(s)=Eπ[Gt∣St=s]=Eπ[∑k=0∞γkRt+k+1St=s] Here:
The action-value function, Qπ(s,a), takes things a step further. It represents the expected return if the agent starts in state s, takes a specific action a, and then follows policy π for all subsequent steps. It answers the question: "How good is it to take action a from state s, and then follow policy π?"
Its definition is: Qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[∑k=0∞γkRt+k+1St=s,At=a] The Q-function is particularly useful for decision-making. If the agent knows the Q-value for all possible actions in a given state s, it can simply choose the action with the highest Q-value to act optimally (or at least, greedily with respect to its current estimates).
Value functions adhere to recursive relationships known as the Bellman equations. These equations decompose the value of a state or state-action pair into the immediate reward received plus the discounted value of the successor state(s).
The Bellman expectation equation for Vπ(s) relates the value of state s to the expected values of the next states, given the policy π: Vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γVπ(s′)] Let's break this down:
The equation essentially says that the value of state s under policy π is the average (over actions taken according to π, and transitions according to the environment) of the immediate reward plus the discounted value of whatever state comes next.
Similarly, the Bellman expectation equation for Qπ(s,a) relates the value of taking action a in state s to the expected value of the next state-action pairs: Qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γ∑a′π(a′∣s′)Qπ(s′,a′)] This can be simplified by substituting the definition of Vπ(s′) back in: Qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γVπ(s′)] This equation states that the value of taking action a in state s is the expected immediate reward plus the expected discounted value of the next state, assuming the agent continues to follow policy π from that next state onwards.
The following diagram illustrates this recursive relationship described by the Bellman expectation equations:
This diagram shows how the value of the current state Vπ(s) depends on the action-values Qπ(s,a) according to the policy π. Each action-value, in turn, depends on the expected value of the next state Vπ(s′) reached after taking the action and receiving a reward, weighted by the environment's transition probabilities.
These expectation equations are fundamental for policy evaluation, which is the process of finding the value functions for a given policy.
While the expectation equations help evaluate a given policy, our ultimate goal in RL is usually to find the best policy, the one that maximizes the expected return from any starting state. This is the optimal policy, denoted π∗.
The value functions corresponding to this optimal policy are the optimal state-value function, V∗(s), and the optimal action-value function, Q∗(s,a).
These optimal value functions satisfy the Bellman optimality equations. Unlike the expectation equations, which involve averaging over the policy's actions, the optimality equations involve a maximization over actions.
The Bellman optimality equation for V∗(s) is: V∗(s)=maxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)] This equation says that the value of a state under an optimal policy must equal the expected return for the best action possible from that state.
The Bellman optimality equation for Q∗(s,a) is: Q∗(s,a)=∑s′,rp(s′,r∣s,a)[r+γmaxa′Q∗(s′,a′)] This equation states that the optimal value of taking action a in state s is the expected immediate reward plus the discounted maximum Q-value achievable from the next state s′. The maxa′ reflects the fact that, starting from state s′, the agent will again choose the best possible action according to the optimal policy.
If we know the optimal action-value function Q∗(s,a), we can easily determine the optimal policy π∗. In any state s, the optimal policy simply chooses the action a that maximizes Q∗(s,a): π∗(s)=argmaxaQ∗(s,a) This is a greedy policy with respect to the optimal Q-function.
Solving the Bellman optimality equations directly is often the goal of RL algorithms. Methods like Value Iteration work by iteratively applying the Bellman optimality update for V∗. Crucially for the methods we'll explore later, Q-Learning uses samples (s,a,r,s′) to iteratively approximate Q∗(s,a) based on the Bellman optimality equation for Q∗.
Understanding these value functions and the Bellman equations that govern them is essential. They provide the theoretical foundation for many RL algorithms, including the tabular methods we'll briefly review next and the more advanced function approximation techniques that form the core of this course.
Was this section helpful?
© 2025 ApX Machine Learning