Okay, let's dive into the specifics of states, actions, and rewards. These three elements form the core feedback loop in nearly every Reinforcement Learning problem. Understanding them clearly is essential before we move on to how an agent actually learns. Building on the idea of an agent interacting with an environment, we now formalize what information flows between them.
A state is a snapshot of the environment at a particular moment in time. It contains the information the agent needs (or is allowed) to perceive to make a decision. Think of it as the current context or situation.
Formally, we denote a specific state at time step t as st. The set of all possible states the environment can be in is called the state space, denoted by S. The nature of a state can vary widely depending on the problem:
An important concept related to states is observability.
For much of this course, especially when introducing core concepts like Markov Decision Processes (MDPs) in the next chapter, we'll assume the environment is fully observable or that the state representation we use captures all relevant information (satisfying the Markov property).
Based on the current state st, the agent chooses an action, denoted as at. An action is simply a decision the agent makes, one of the choices available to it.
The set of all possible actions the agent can take is the action space, denoted by A. Sometimes, the available actions depend on the current state, in which case we might write A(st) for the set of valid actions in state st. Like states, actions can be:
move_north
, move_south
, move_east
, move_west
.joystick_left
, joystick_right
, button_fire
.The nature of the action space significantly influences which RL algorithms are most suitable.
After the agent takes action at in state st, the environment transitions to a new state st+1 and provides a numerical reward, denoted rt+1. This reward signal is crucial; it's the primary feedback mechanism that tells the agent how well it's doing.
The Reward Hypothesis is a fundamental concept in RL: It states that all goals and purposes can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (the reward). Designing an effective reward function is often one of the most challenging aspects of applying RL in practice. A poorly designed reward function can lead to unintended or suboptimal agent behavior. For instance, rewarding a cleaning robot solely for collecting dust might lead it to dump the dust just to collect it again!
These three components form the basis of the agent-environment interaction loop:
The agent observes the state, selects an action which influences the environment. The environment then provides the next state and a reward back to the agent, completing one step of the interaction loop.
Understanding states, actions, and rewards is the first step in formally defining an RL problem. In the next chapter, we will see how these elements, along with the environment's dynamics, are captured within the framework of Markov Decision Processes (MDPs).
© 2025 ApX Machine Learning