Policy and Value Functions

In the context of Markov Decision Processes (MDPs), grasping policy and value functions is pivotal for navigating the intricacies of decision-making in uncertain environments. These concepts help define how an agent behaves within an environment, aiming to optimize the rewards it accumulates over time.

A policy is a fundamental component of reinforcement learning, representing the strategy that an agent employs to decide its actions. Formally, a policy, denoted as $\pi$ , is a mapping from states to probabilities of selecting each possible action. In simpler terms, given a state, the policy guides the agent on the best action to take or the probabilities of choosing among possible actions. Policies can be deterministic, where a specific action is chosen in each state, or stochastic, where actions are selected based on probability distributions.

Consider a simple example: imagine a robot navigating a grid. The robot's policy would dictate its movement, whether to move up, down, left, or right, based on its current position on the grid. A deterministic policy might always instruct the robot to move right until it reaches the goal, while a stochastic policy could assign a 70% probability to moving right and 30% to moving down, adding an element of randomness to its decision-making.

Visualization of a grid world environment with an agent

The value function is another key concept that quantifies the expected cumulative reward an agent can achieve from any given state, assuming it follows a certain policy. There are two primary types of value functions: the state-value function and the action-value function.

State-Value Function (V): This function, denoted as $V(s)$ , represents the expected return (sum of rewards) starting from state $s$ and following a particular policy $\pi$ thereafter. The state-value function helps the agent evaluate how good it is to be in a certain state under a given policy. It is defined as: $V(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s \right]$ where $\gamma$ is the discount factor, which balances the importance of immediate versus future rewards.
Action-Value Function (Q): Also known as the Q-function, this function, denoted as $Q(s, a)$ , represents the expected return of taking action $a$ in state $s$ and then following policy $\pi$ . This function is particularly useful when evaluating the quality of specific actions in given states, allowing the agent to compare potential actions and choose the best one. It is defined as: $Q(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right]$

The Bellman equation provides a recursive decomposition of these value functions, enabling efficient computation. For the state-value function, the Bellman equation is expressed as: $V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s, a) [R(s, a, s') + \gamma V(s')]$ This equation states that the value of a state is the expected value of the immediate reward plus the discounted value of the next state, averaged over all possible actions and subsequent states.

For the action-value function, the Bellman equation is: $Q(s, a) = \sum_{s'} P(s'|s, a) [R(s, a, s') + \gamma \sum_{a'} \pi(a'|s') Q(s', a')]$

Through these recursive relationships, value functions incorporate the concept of optimality, seeking to find a policy that maximizes the expected cumulative reward. The optimal policy $\pi^*$ leads to the optimal state-value function $V^*(s)$ and the optimal action-value function $Q^*(s, a)$ , which provide the highest possible value for each state or state-action pair.

Understanding and leveraging policy and value functions allow reinforcement learning agents to systematically improve their decision-making strategies. By iteratively updating policies and value functions, agents can learn to navigate complex environments, making informed decisions that maximize their long-term rewards. This iterative learning and optimization lie at the heart of many reinforcement learning algorithms, including policy iteration and value iteration, which will be explored further in subsequent chapters.