Okay, we've established that an agent interacts with an environment by observing states and taking actions to receive rewards. But how does the agent actually decide which action to take when it finds itself in a particular state? This decision-making logic is encapsulated in what we call the agent's policy. Think of the policy as the agent's strategy or its behavioral "brain."
Formally, a policy is a mapping from states to actions. It defines the agent's way of behaving at a given time. Policies can generally be categorized into two main types: deterministic and stochastic.
A deterministic policy directly specifies the action the agent will take for each state. If the agent is in state s, the policy π provides a single action a. We can write this as:
a=π(s)For every state s in the set of all possible states S, the policy π outputs a specific action a from the set of available actions A(s). Imagine a simple robot navigating a maze. A deterministic policy might be: "If you are at position X and the path ahead is clear, move forward. If blocked, turn right." For a given state (position X, path clear), the action (move forward) is fixed.
In contrast, a stochastic policy defines a probability distribution over actions for each state. Instead of outputting a single action, it tells us the probability of taking each possible action a when in state s. We denote this as π(a∣s):
π(a∣s)=P[At=a∣St=s]Here, P[At=a∣St=s] represents the probability that the action At taken at time step t is a, given that the state St at time t is s. The sum of probabilities for all possible actions in a given state must equal 1:
a∈A(s)∑π(a∣s)=1for all s∈SWhy use stochastic policies? They are particularly useful in several scenarios:
Here's a simple illustration contrasting the two:
A visual comparison of a deterministic policy (always choosing action A1 from state S1) versus a stochastic policy (choosing action A1 with probability 0.7 and action A2 with probability 0.3 from state S1).
The central objective in Reinforcement Learning is typically to find an optimal policy, denoted as π∗. An optimal policy is one that maximizes the expected cumulative reward the agent receives over the long run, starting from any state. Much of this course will focus on algorithms designed to learn or approximate π∗.
How policies are represented depends on the complexity of the problem. For simple environments with a small number of discrete states and actions, a policy might be stored in a lookup table. However, for problems with large or continuous state spaces (like controlling a robot based on sensor readings or playing video games from pixels), we often use function approximators, such as linear functions or neural networks, to represent the policy π. These approximators take the state representation as input and output either the action (deterministic) or the probabilities of actions (stochastic). We will cover function approximation in detail in later chapters.
In summary, the policy is the core component dictating the agent's behavior. It's the strategy the agent follows to select actions based on states, and finding the best possible strategy, the optimal policy, is the fundamental goal of most RL algorithms. Understanding the difference between deterministic and stochastic policies is essential as we move towards exploring methods for learning these policies from experience.
© 2025 ApX Machine Learning