As we prepare to study advanced reinforcement learning techniques, it's essential to revisit the mathematical foundation upon which nearly all RL algorithms are built: the Markov Decision Process (MDP). An MDP provides a formal framework for modeling sequential decision-making problems where outcomes are partly random and partly under the control of a decision-maker, or agent.
Think of an agent interacting with an environment over a sequence of discrete time steps t=0,1,2,.... At each step t, the agent observes the environment's state St, selects an action At, receives a scalar reward Rt+1, and transitions to a new state St+1. The MDP formalizes this interaction.
A standard MDP is defined by a tuple (S,A,P,R,γ):
The set of all possible states the environment can be in. A state s∈S should ideally capture all relevant information about the environment needed to make an optimal decision. States can range from simple, discrete representations (like positions on a chessboard) to complex, high-dimensional, continuous vectors (like pixel values from a camera feed or joint angles of a robot). In this advanced course, we will frequently deal with large or continuous state spaces where function approximation becomes necessary.
The set of all possible actions the agent can take. Similar to states, actions a∈A can be discrete (like 'up', 'down', 'left', 'right' in a grid world) or continuous (like the amount of torque to apply to a motor). The specific set of actions available might depend on the current state s, denoted as A(s).
The transition probability function defines the dynamics of the environment. It specifies the probability of transitioning to a state s′ and receiving reward r given that the agent was in state s and took action a. This is often written as p(s′,r∣s,a):
p(s′,r∣s,a)=Pr{St+1=s′,Rt+1=r∣St=s,At=a}Sometimes, the transition probability is defined purely over states, P(s′∣s,a)=∑rp(s′,r∣s,a), with the reward function handled separately.
A defining characteristic of an MDP is the Markov Property. This property states that the future state St+1 and reward Rt+1 depend only on the current state St and the action At, not on the entire history of previous states and actions. Mathematically:
Pr{St+1=s′,Rt+1=r∣St,At,St−1,At−1,...,S0,A0}=Pr{St+1=s′,Rt+1=r∣St,At}The current state St is assumed to encapsulate all necessary information from the past. While real-world problems might not perfectly adhere to this, the MDP framework is often a powerful and effective approximation.
The reward function specifies the immediate numerical feedback the agent receives. It can be defined in several ways, often as the expected immediate reward upon transitioning from state s after taking action a:
r(s,a)=E[Rt+1∣St=s,At=a]=r∑rs′∑p(s′,r∣s,a)Alternatively, it might depend on the resulting state s′ as well: r(s,a,s′)=E[Rt+1∣St=s,At=a,St+1=s′]. The reward signal Rt+1 is fundamental; it defines the goal of the RL problem. The agent's objective is derived from maximizing the cumulative sum of these rewards over time.
The discount factor γ is a scalar between 0 and 1 (0≤γ≤1). It determines the present value of future rewards. A reward received k time steps in the future is worth only γk−1 times what it would be worth if received immediately.
An agent's behavior is defined by its policy, π. A policy is a mapping from states to probabilities of selecting each possible action. If the agent is in state s at time t, then π(a∣s) is the probability that At=a. Reinforcement learning methods aim to find a policy that maximizes the expected cumulative reward.
The agent's objective is to maximize the expected return, which is the cumulative sum of discounted rewards starting from time step t. The return Gt is defined as:
Gt=Rt+1+γRt+2+γ2Rt+3+...=k=0∑∞γkRt+k+1The core task in RL is to find a policy π that maximizes the expected return E[Gt] from each state s. To achieve this, we often estimate value functions, which quantify the expected return from a state (Vπ(s)) or a state-action pair (Qπ(s,a)) when following policy π. We will examine these value functions and the equations that govern them (the Bellman equations) in the next section.
Understanding this MDP formulation, its components, and the underlying Markov assumption is the starting point for developing and analyzing the advanced RL algorithms covered in this course. Even when dealing with complex deep learning models, large state/action spaces, or multi-agent scenarios, these fundamental concepts remain central.
© 2025 ApX Machine Learning