As we saw in the previous chapter, Reinforcement Learning involves an agent interacting with an environment over time, learning to make decisions that maximize some notion of cumulative reward. To analyze and solve these problems rigorously, we need a precise mathematical framework. The most common framework for sequential decision-making under uncertainty is the Markov Decision Process, or MDP.
An MDP provides a formal way to describe the environment in an RL problem. It assumes the environment satisfies the Markov property, meaning the future state and reward depend only on the current state and action, not on the entire history leading up to them. We'll revisit this property shortly.
Formally, an MDP is defined by a tuple containing five components: (S,A,P,R,γ). Let's break down each part:
The state space S is the set of all possible states the environment can be in. A state s∈S should ideally capture all relevant information about the environment needed to make an optimal decision. Think of it as a snapshot of the situation at a particular moment.
For example, in a simple grid world game, a state could be the agent's (x,y) coordinates. In a game of Go, the state would describe the configuration of stones on the board. States can be discrete (like in the grid world example, with a finite number of positions) or continuous (like the precise joint angles and velocities of a robot arm). The complexity and size of the state space significantly impact which RL algorithms are suitable.
A significant assumption underpinning MDPs is the Markov Property. It states that the probability of transitioning to the next state s′ depends only on the current state s and the action a taken, and is independent of all previous states and actions. Mathematically, if St is the state at time t and At is the action at time t:
P(St+1=s′∣St=s,At=a,St−1,At−1,...,S0,A0)=P(St+1=s′∣St=s,At=a)The current state st is said to contain all necessary information from the history. This property simplifies modeling substantially, as we don't need to track the entire sequence of past events, only the present state.
The action space A is the set of all possible actions the agent can choose from. Sometimes, the set of available actions depends on the current state s, in which case we denote it as A(s). An action a∈A (or a∈A(s)) is the decision made by the agent at a given time step.
In our grid world, actions might be A={up, down, left, right}. For a thermostat controlling room temperature, actions could be A={heat, cool, off}. Actions, like states, can be discrete (a finite set of choices) or continuous (e.g., the amount of force to apply to a pedal, represented by a real number).
The transition probability function P describes the dynamics of the environment, often called the model. It specifies the probability of moving from the current state s to a possible next state s′ given that the agent takes action a. It's formally written as a function P:S×A×S→[0,1]:
P(s′∣s,a)=Pr(St+1=s′∣St=s,At=a)This function tells us how likely each potential outcome s′ is, given where we are (s) and what we do (a). Since these are probabilities, the sum over all possible next states s′ must equal 1 for any starting state s and action a:
s′∈S∑P(s′∣s,a)=1for all s∈S,a∈A(s)The transition function captures the underlying rules of the environment. If P(s′∣s,a)=1 for a specific s′, the transition is deterministic for that (s,a) pair. More often, environments are stochastic, meaning taking the same action in the same state might lead to different outcomes with different probabilities (e.g., a robot's movement might be affected by wheel slippage).
The reward function R defines the immediate feedback signal the agent receives from the environment. It quantifies the desirability of transitions or states. Rewards are scalar values, r∈R. There are a few common ways to define the reward function:
A frequently used definition gives the expected immediate reward given the current state and action:
R(s,a)=E[Rt+1∣St=s,At=a]=s′∈S∑P(s′∣s,a)r∑r⋅p(r∣s,a,s′)where p(r∣s,a,s′) is the probability of receiving reward r upon transitioning to s′ from s via a. Simpler forms like R(s,a,s′) are also common.
The reward function implicitly specifies the goal of the RL agent. The agent learns to select actions that maximize the cumulative reward over time, not just the immediate reward. Designing an effective reward function is an important aspect of applying RL. For instance, in a game, giving a reward only at the end (+1 for winning, -1 for losing) results in sparse rewards, which can make learning slow. Providing intermediate rewards (e.g., for capturing opponent pieces in chess) can create denser feedback but must be done carefully to avoid encouraging undesirable short-sighted strategies.
The discount factor γ (gamma) is a value between 0 and 1 (0≤γ≤1). It controls the importance of future rewards relative to immediate rewards. A reward received k time steps into the future is only worth γk times what it would be worth if received immediately.
The discount factor serves several purposes:
The agent's objective is typically formulated as maximizing the expected discounted return, which we will define more formally later. It's the sum of discounted rewards obtained over the future.
The five components that formally define a Markov Decision Process.
These five components, S, A, P, R, and γ, together provide a complete specification of the sequential decision-making problem under the MDP framework. They define the structure of the environment and the goal the Reinforcement Learning agent aims to achieve: finding a strategy (a policy) for choosing actions that maximizes the expected long-term discounted reward. Understanding these components is fundamental to grasping the algorithms we will explore later for solving MDPs.
© 2025 ApX Machine Learning