As introduced, Reinforcement Learning often deals with sequential decision-making problems. The standard mathematical framework used to model these problems is the Markov Decision Process (MDP). If you've encountered RL before, you'll recognize MDPs as the bedrock upon which most algorithms are built. Let's quickly refresh our understanding of its core components.
An MDP formally describes the environment an RL agent interacts with. It assumes the environment is fully observable and satisfies the Markov property: the future state depends only on the current state and action, not on the sequence of states and actions that preceded it.
An MDP is typically defined by a tuple containing five elements: (S,A,P,R,γ).
This is the set of all possible situations the agent can find itself in. States encapsulate all the necessary information about the environment relevant to the decision-making process. For example, in a chess game, the state would be the configuration of all pieces on the board. In a robot navigation task, the state might be the robot's coordinates (x,y) and orientation. State spaces can be:
This is the set of all possible actions the agent can take. The available actions might depend on the current state, sometimes denoted as A(s). Like states, action spaces can be:
This function defines the dynamics of the environment. It specifies the probability of transitioning to a new state s′ after the agent takes action a in state s. It's written as: P(s′∣s,a)=Pr(St+1=s′∣St=s,At=a) This probability distribution captures the stochasticity or uncertainty inherent in the environment's response to the agent's actions. The Markov property is embedded here: the next state s′ only depends on the current state s and action a.
The reward function defines the goal of the RL problem. It specifies the immediate numerical reward r the agent receives after taking action a in state s and transitioning to state s′. It can be defined in slightly different ways, often as:
The agent's objective is to maximize the cumulative reward over time, not just the immediate reward. Rewards guide the learning process, indicating which actions lead to desirable outcomes.
The discount factor γ is a value between 0 and 1 (0≤γ≤1) that determines the present value of future rewards. A reward received k steps in the future is discounted by a factor of γk.
The discount factor ensures that the total expected reward remains finite in continuing tasks (tasks without a terminal state) and allows us to mathematically handle infinite sequences of rewards.
The interaction between the agent and the environment within an MDP framework follows a cycle:
The basic interaction loop in a Markov Decision Process. The agent observes a state, takes an action, and the environment responds with a new state and a reward.
Understanding this formal structure is necessary because RL algorithms are essentially methods for finding optimal policies within MDPs. When we discuss Q-learning, DQN, Policy Gradients, and Actor-Critic methods, they are all operating under the assumption that the problem can be modeled, at least approximately, as an MDP. The limitations of tabular methods, which we'll touch on next, arise when the components of this tuple, particularly the state and action spaces, become too large or complex to handle explicitly.
© 2025 ApX Machine Learning