For successful Reinforcement Learning, an agent requires a method to assess the quality of a given situation. Value functions serve this purpose. They quantify the expected long-term return (cumulative discounted reward) an agent can achieve. These include two primary types: the state-value function $V(s)$ and the action-value function $Q(s, a)$.The State-Value Function: $V^\pi(s)$The state-value function, denoted as $V^\pi(s)$, represents the expected return if the agent starts in state $s$ and follows a specific policy $\pi$ thereafter. Think of it as answering the question: "How good is it to be in state $s$ if I follow policy $\pi$?"Mathematically, it's defined as: $$ V^\pi(s) = \mathbb{E}\pi [ G_t | S_t = s ] = \mathbb{E}\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \Big| S_t = s \right] $$ Here:$G_t$ is the total discounted return starting from time step $t$.$\gamma$ is the discount factor ($0 \le \gamma \le 1$), which determines the present value of future rewards. A $\gamma$ close to 0 makes the agent "myopic" (focused on immediate rewards), while a $\gamma$ close to 1 makes it "farsighted" (valuing future rewards highly).$\mathbb{E}_\pi[\cdot]$ denotes the expected value, assuming the agent follows policy $\pi$. This expectation accounts for the randomness in actions chosen by the policy (if it's stochastic) and the randomness in the environment's transitions and rewards.The Action-Value Function: $Q^\pi(s, a)$The action-value function, $Q^\pi(s, a)$, takes things a step further. It represents the expected return if the agent starts in state $s$, takes a specific action $a$, and then follows policy $\pi$ for all subsequent steps. It answers the question: "How good is it to take action $a$ from state $s$, and then follow policy $\pi$?"Its definition is: $$ Q^\pi(s, a) = \mathbb{E}\pi [ G_t | S_t = s, A_t = a ] = \mathbb{E}\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \Big| S_t = s, A_t = a \right] $$ The $Q$-function is particularly useful for decision-making. If the agent knows the $Q$-value for all possible actions in a given state $s$, it can simply choose the action with the highest $Q$-value to act optimally (or at least, greedily with respect to its current estimates).Bellman Expectation EquationsValue functions adhere to recursive relationships known as the Bellman equations. These equations decompose the value of a state or state-action pair into the immediate reward received plus the discounted value of the successor state(s).The Bellman expectation equation for $V^\pi(s)$ relates the value of state $s$ to the expected values of the next states, given the policy $\pi$: $$ V^\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V^\pi(s')] $$ Let's break this down:$\pi(a|s)$: The probability of taking action $a$ in state $s$ under policy $\pi$.$p(s', r | s, a)$: The probability of transitioning to state $s'$ and receiving reward $r$, given state $s$ and action $a$. This defines the environment's dynamics.$r + \gamma V^\pi(s')$: The immediate reward $r$ plus the discounted value of the next state $s'$.The equation essentially says that the value of state $s$ under policy $\pi$ is the average (over actions taken according to $\pi$, and transitions according to the environment) of the immediate reward plus the discounted value of whatever state comes next.Similarly, the Bellman expectation equation for $Q^\pi(s, a)$ relates the value of taking action $a$ in state $s$ to the expected value of the next state-action pairs: $$ Q^\pi(s, a) = \sum_{s', r} p(s', r | s, a) \left[ r + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right] $$ This can be simplified by substituting the definition of $V^\pi(s')$ back in: $$ Q^\pi(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma V^\pi(s')] $$ This equation states that the value of taking action $a$ in state $s$ is the expected immediate reward plus the expected discounted value of the next state, assuming the agent continues to follow policy $\pi$ from that next state onwards.The following diagram illustrates this recursive relationship described by the Bellman expectation equations:digraph BellmanExpectation { rankdir=TB; node [shape=record, style=filled, fillcolor="#e9ecef", fontname="Helvetica"]; edge [fontname="Helvetica", fontsize=10]; current_state [label="{State s | Value Vπ(s)}", fillcolor="#a5d8ff"]; action_layer [shape=point, width=0]; // Invisible node for branching current_state -> action_layer [label="Follow policy π(a|s)"]; action1 [label="{Action a1 | Value Qπ(s, a1)}"]; action2 [label="{Action a2 | Value Qπ(s, a2)}"]; action_dots [label="...", shape=plaintext]; action_layer -> action1 [style=invis]; // Invisible edges for layout action_layer -> action2 [style=invis]; action_layer -> action_dots [style=invis]; next_state_layer1 [shape=point, width=0]; next_state_layer2 [shape=point, width=0]; action1 -> next_state_layer1 [label="Env dynamics p(s', r | s, a1)"]; action2 -> next_state_layer2 [label="Env dynamics p(s', r | s, a2)"]; next_state1_a1 [label="{State s' | Future: r + γVπ(s')}"]; next_state2_a1 [label="{State s'' | Future: r + γVπ(s'')}"]; next_state_dots_a1 [label="...", shape=plaintext]; next_state1_a2 [label="{State s''' | Future: r + γVπ(s''')}"]; next_state_dots_a2 [label="...", shape=plaintext]; next_state_layer1 -> next_state1_a1 [style=invis]; next_state_layer1 -> next_state2_a1 [style=invis]; next_state_layer1 -> next_state_dots_a1 [style=invis]; next_state_layer2 -> next_state1_a2 [style=invis]; next_state_layer2 -> next_state_dots_a2 [style=invis]; // Position hints { rank = same; current_state } { rank = same; action1; action2; action_dots } { rank = same; next_state1_a1; next_state2_a1; next_state_dots_a1; next_state1_a2; next_state_dots_a2 } // Connections showing calculation subgraph cluster_V { label = "Calculation of Vπ(s)"; style = invis; V_calc [label="Vπ(s) = Σa π(a|s) Qπ(s, a)", shape=plaintext, fontsize=11]; current_state -> V_calc [style=invis]; V_calc -> action1 [style=dashed, arrowhead=none]; V_calc -> action2 [style=dashed, arrowhead=none]; } subgraph cluster_Q1 { label = "Calculation of Qπ(s, a1)"; style = invis; Q1_calc [label="Qπ(s, a1) = Σs',r p(s',r|s,a1) [r + γVπ(s')]", shape=plaintext, fontsize=11]; action1 -> Q1_calc [style=invis]; Q1_calc -> next_state1_a1 [style=dashed, arrowhead=none]; Q1_calc -> next_state2_a1 [style=dashed, arrowhead=none]; } subgraph cluster_Q2 { label = "Calculation of Qπ(s, a2)"; style = invis; Q2_calc [label="Qπ(s, a2) = Σs',r p(s',r|s,a2) [r + γVπ(s')]", shape=plaintext, fontsize=11]; action2 -> Q2_calc [style=invis]; Q2_calc -> next_state1_a2 [style=dashed, arrowhead=none]; } }This diagram shows how the value of the current state $V^\pi(s)$ depends on the action-values $Q^\pi(s, a)$ according to the policy $\pi$. Each action-value, in turn, depends on the expected value of the next state $V^\pi(s')$ reached after taking the action and receiving a reward, weighted by the environment's transition probabilities.These expectation equations are fundamental for policy evaluation, which is the process of finding the value functions for a given policy.Bellman Optimality EquationsWhile the expectation equations help evaluate a given policy, our ultimate goal in RL is usually to find the best policy, the one that maximizes the expected return from any starting state. This is the optimal policy, denoted $\pi^*$.The value functions corresponding to this optimal policy are the optimal state-value function, $V^(s)$, and the optimal action-value function, $Q^(s, a)$.$V^*(s) = \max_{\pi} V^\pi(s)$$Q^*(s, a) = \max_{\pi} Q^\pi(s, a)$These optimal value functions satisfy the Bellman optimality equations. Unlike the expectation equations, which involve averaging over the policy's actions, the optimality equations involve a maximization over actions.The Bellman optimality equation for $V^*(s)$ is: $$ V^(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma V^(s')] $$ This equation says that the value of a state under an optimal policy must equal the expected return for the best action possible from that state.The Bellman optimality equation for $Q^*(s, a)$ is: $$ Q^(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} Q^(s', a')] $$ This equation states that the optimal value of taking action $a$ in state $s$ is the expected immediate reward plus the discounted maximum Q-value achievable from the next state $s'$. The $\max_{a'}$ reflects the fact that, starting from state $s'$, the agent will again choose the best possible action according to the optimal policy.If we know the optimal action-value function $Q^(s, a)$, we can easily determine the optimal policy $\pi^$. In any state $s$, the optimal policy simply chooses the action $a$ that maximizes $Q^(s, a)$: $$ \pi^(s) = \arg\max_a Q^*(s, a) $$ This is a greedy policy with respect to the optimal Q-function.Solving the Bellman optimality equations directly is often the goal of RL algorithms. Methods like Value Iteration work by iteratively applying the Bellman optimality update for $V^$. Crucially for the methods we'll explore later, Q-Learning uses samples $(s, a, r, s')$ to iteratively approximate $Q^(s, a)$ based on the Bellman optimality equation for $Q^*$.Understanding these value functions and the Bellman equations that govern them is essential. They provide the theoretical foundation for many RL algorithms, including the tabular methods we'll briefly review next and the more advanced function approximation techniques that form the core of this course.