Building upon the core concepts of agents, environments, and their interactions, we now require a formal framework to represent the problems Reinforcement Learning addresses. This chapter introduces Markov Decision Processes (MDPs), the standard mathematical tool for modeling sequential decision-making where outcomes are partly random and partly under the control of a decision-maker.
You will learn to define the key elements of an MDP:
We will explore how an agent's behavior is defined by a policy (π) and how to evaluate the "goodness" of states and state-action pairs using value functions (Vπ and Qπ). This leads to understanding the goal of RL within the MDP framework: finding an optimal policy (π∗) that maximizes expected cumulative rewards.
2.1 Modeling Sequential Decision Making
2.2 Formal Definition of an MDP
2.3 State Transition Probabilities
2.4 Reward Functions
2.5 Return: Cumulative Future Rewards
2.6 Discounting Future Rewards
2.7 Policies and Value Functions (Vπ, Qπ)
2.8 Finding Optimal Policies
© 2025 ApX Machine Learning