Having established the core components of an MDP. states (S), actions (A), rewards (R), and the discount factor (γ). we now focus on the element that describes how the environment behaves: the state transition probabilities, often denoted as P. These probabilities define the dynamics of the environment, specifying how states change in response to the agent's actions.
Think of the transition probabilities as the underlying rules or physics governing the environment. When the agent is in a particular state s and chooses to take an action a, what happens next? In many situations, the outcome isn't fixed. The environment might respond in one of several possible next states s′. The transition probability function gives us the likelihood of each possible outcome.
Formally, the state transition probability is defined as:
P(s′∣s,a)=Pr{St+1=s′∣St=s,At=a}This equation reads: "The probability of transitioning to state s′ at the next time step (t+1), given that the current state is s at time step t and the agent takes action a at time step t."
It's important to understand that these probabilities describe the environment's dynamics, not the agent's decision-making process (which is governed by the policy π). For any given state s and action a taken from that state, the probabilities of transitioning to all possible next states s′ must sum to 1:
s′∈S∑P(s′∣s,a)=1,for all s∈S,a∈A(s)where A(s) is the set of actions available in state s.
A fundamental assumption underlying MDPs is the Markov property. This property states that the future depends only on the present, not on the past. In the context of state transitions, it means the probability of moving to the next state s′ depends only on the current state s and the current action a. The history of states visited and actions taken before arriving at state s is irrelevant for predicting the immediate future.
Mathematically, the Markov property implies:
P(St+1=s′∣St=s,At=a)=P(St+1=s′∣St,At,St−1,At−1,…,S0,A0)This simplification is powerful because it allows us to model complex sequential problems without needing to keep track of the entire history. If the current state s fully captures all relevant information from the past needed to predict the future, the Markov property holds. Many real-world problems can be effectively modeled or approximated as MDPs by carefully defining the state representation to satisfy this property.
Consider a simple 3x3 grid where the agent can move North, South, East, or West. Let the states be the coordinates (x,y), where x,y∈{0,1,2}.
Deterministic Transitions: In a perfectly predictable environment, taking action 'East' from state (0,0) would always result in state (1,0). The transition probability would be P((1,0)∣(0,0),East)=1, and the probability of ending up in any other state would be 0.
Stochastic Transitions: Now, imagine a "slippery" grid. If the agent chooses action 'East' from state (0,0), maybe there's an 80% chance it works as intended (ending in (1,0)), a 10% chance it slips and stays in (0,0), and a 10% chance it slips sideways and ends up in state (0,1) (assuming (0,1) is North of (0,0)). The transition probabilities would be:
Notice that 0.8+0.1+0.1=1.0. The environment introduces uncertainty.
Here's a small diagram illustrating these stochastic transitions for the action 'East' from state (0,0):
Transition probabilities from state s=(0,0) when taking action a=East in a stochastic grid environment.
In some problems, we might be given the transition probabilities P(s′∣s,a) explicitly. This means we have a complete model of the environment's dynamics. Techniques like Dynamic Programming (which we'll discuss later) rely on having such a model.
However, in many practical reinforcement learning scenarios, we don't know these probabilities beforehand. The agent must learn about the environment's dynamics purely through interaction. sampling transitions (s,a,r,s′) by trying actions and observing outcomes. This is the domain of model-free RL methods like Q-learning and SARSA, which are central topics in this course.
Understanding state transition probabilities is fundamental to grasping how MDPs formalize sequential decision problems. They represent the inherent dynamics the agent must navigate, whether those dynamics are known in advance or must be learned through experience.
© 2025 ApX Machine Learning