The Markov Decision Process (MDP) is the mathematical foundation essential for understanding advanced reinforcement learning techniques. It provides a formal framework for modeling sequential decision-making problems where outcomes are partly random and partly under the control of a decision-maker, or agent.
Think of an agent interacting with an environment over a sequence of discrete time steps . At each step , the agent observes the environment's state , selects an action , receives a scalar reward , and transitions to a new state . The MDP formalizes this interaction.
A standard MDP is defined by a tuple :
The set of all possible states the environment can be in. A state should ideally capture all relevant information about the environment needed to make an optimal decision. States can range from simple, discrete representations (like positions on a chessboard) to complex, high-dimensional, continuous vectors (like pixel values from a camera feed or joint angles of a robot). In this advanced course, we will frequently deal with large or continuous state spaces where function approximation becomes necessary.
The set of all possible actions the agent can take. Similar to states, actions can be discrete (like 'up', 'down', 'left', 'right' in a grid) or continuous (like the amount of torque to apply to a motor). The specific set of actions available might depend on the current state , denoted as .
The transition probability function defines the dynamics of the environment. It specifies the probability of transitioning to a state and receiving reward given that the agent was in state and took action . This is often written as :
Sometimes, the transition probability is defined purely over states, , with the reward function handled separately.
A defining characteristic of an MDP is the Markov Property. This property states that the future state and reward depend only on the current state and the action , not on the entire history of previous states and actions. Mathematically:
"The current state is assumed to encapsulate all necessary information from the past. While problems might not perfectly adhere to this, the MDP framework is often a powerful and effective approximation."
The reward function specifies the immediate numerical feedback the agent receives. It can be defined in several ways, often as the expected immediate reward upon transitioning from state after taking action :
Alternatively, it might depend on the resulting state as well: . The reward signal is fundamental; it defines the goal of the RL problem. The agent's objective is derived from maximizing the cumulative sum of these rewards over time.
The discount factor is a scalar between 0 and 1 (). It determines the present value of future rewards. A reward received time steps in the future is worth only times what it would be worth if received immediately.
An agent's behavior is defined by its policy, . A policy is a mapping from states to probabilities of selecting each possible action. If the agent is in state at time , then is the probability that . Reinforcement learning methods aim to find a policy that maximizes the expected cumulative reward.
The agent's objective is to maximize the expected return, which is the cumulative sum of discounted rewards starting from time step . The return is defined as:
The core task in RL is to find a policy that maximizes the expected return from each state . To achieve this, we often estimate value functions, which quantify the expected return from a state () or a state-action pair () when following policy . We will examine these value functions and the equations that govern them (the Bellman equations) in the next section.
Understanding this MDP formulation, its components, and the underlying Markov assumption is the starting point for developing and analyzing the advanced RL algorithms covered in this course. Even when dealing with complex deep learning models, large state/action spaces, or multi-agent scenarios, these fundamental concepts remain central.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with