Okay, we've established the Markov Decision Process (MDP) as our framework, defining the environment's states (S), the agent's possible actions (A), the rules of state transitions (P), immediate rewards (R), and the discount factor (γ). Now, how does an agent actually decide what to do in this environment? And how do we measure how good those decisions are in the long run? This is where policies and value functions come into play.
A policy, denoted by π, is the agent's strategy or behavior. It dictates which action the agent will choose when it finds itself in a particular state. Think of it as the agent's controller or decision-making function.
Policies can be categorized into two main types:
Deterministic Policies: For any given state s, a deterministic policy always specifies a single action a. We write this as:
π(s)=aIf the agent is in state s, it will take action a.
Stochastic Policies: In contrast, a stochastic policy specifies a probability distribution over actions for each state. We write this as:
π(a∣s)=P(At=a∣St=s)This gives the probability that the agent selects action a when in state s. The sum of probabilities over all possible actions in a state must equal 1: ∑a∈A(s)π(a∣s)=1.
Why use stochastic policies? They are particularly useful in several scenarios:
The policy is fundamental because it defines how the agent interacts with the MDP. Given a policy π, the agent's sequence of states, actions, and rewards becomes a stochastic process determined by both the environment's dynamics (P) and the agent's choices (π).
Knowing the agent's strategy (π) is one thing, but how do we know if it's a good strategy? We need a way to quantify the long-term value of following a policy. This is the role of value functions. Value functions estimate the expected cumulative future discounted reward, also known as the expected return, under a specific policy.
There are two primary types of value functions:
The state-value function, Vπ(s), measures the expected return if the agent starts in state s and follows policy π thereafter. It tells us "how good" it is to be in state s under policy π.
Formally, Vπ(s) is defined as:
Vπ(s)=Eπ[Gt∣St=s]=Eπ[k=0∑∞γkRt+k+1∣St=s]Let's break this down:
So, Vπ(s) averages the total discounted rewards over all possible trajectories that could start from state s when following policy π. A higher Vπ(s) indicates that state s is more desirable under policy π.
The action-value function, Qπ(s,a), measures the expected return if the agent starts in state s, takes a specific action a (just for the first step), and then follows policy π for all subsequent decisions. It tells us "how good" it is to take action a from state s, and then continue with policy π.
Formally, Qπ(s,a) is defined as:
Qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a]The key difference from Vπ(s) is the additional condition At=a. We are now evaluating the consequence of taking a specific first action a before letting the policy π take over.
The action-value function Qπ(s,a) is often more directly useful for decision-making than Vπ(s). If the agent knows Qπ(s,a) for all actions a available in state s, it can simply choose the action with the highest Q-value to act greedily with respect to the current policy's valuation.
The state-value and action-value functions are closely related.
The value of a state s under a stochastic policy π is the expected value of the Q-values for all actions possible in s, weighted by the policy's probability of choosing each action:
Vπ(s)=a∈A(s)∑π(a∣s)Qπ(s,a)This makes intuitive sense: the value of being in a state is the average value of taking each possible action from that state, according to the policy's action probabilities.
Conversely, the value of taking action a in state s and then following policy π depends on the immediate reward received and the discounted value of the next state s′ reached. We need to consider all possible next states s′ and rewards r determined by the environment's transition dynamics p(s′,r∣s,a).
Qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]This equation states that the value of the state-action pair (s,a) is the expected immediate reward (r) plus the expected discounted value of the next state (γVπ(s′)), averaged over all possible outcomes (s′,r) according to the environment dynamics p(s′,r∣s,a).
The following diagram helps visualize the relationship between a policy and the value functions:
A simple MDP showing states (s1, s2, s3), actions (up, right, left), transition probabilities (p), rewards (R), and a policy (π). Vπ(s) estimates the long-term value starting from state s under policy π. Qπ(s,a) estimates the value of taking action a first, then following π. For example, Qπ(s1,right) considers the immediate reward (+1) and the discounted value of the resulting state (γVπ(s2)).
These relationships between Vπ and Qπ are recursive and form the core of the Bellman equations, which we will explore in detail in the next chapter. They provide a way to break down the complex long-term value calculation into manageable, iterative steps.
Understanding policies and value functions is fundamental. Policies define what the agent does, and value functions evaluate how good those actions are in the context of achieving the overall goal: maximizing the expected cumulative reward. The ultimate objective in many RL problems is not just to evaluate a given policy, but to find an optimal policy (π∗), one that achieves the highest possible expected return from any starting state. This optimal policy will have corresponding optimal state-value (V∗) and action-value (Q∗) functions. Learning these optimal functions and policies is the central theme of the algorithms we will cover later in this course.
© 2025 ApX Machine Learning