Okay, let's begin by revisiting the basic structure of a Reinforcement Learning problem. As mentioned in the chapter introduction, RL is fundamentally about learning through interaction. At its heart, the RL setup involves two primary components: the agent and the environment.
This interaction unfolds over a sequence of discrete time steps, t=0,1,2,.... At each time step t, the agent observes the current situation of the environment, which is represented as a state, denoted by St. A state is a description of the environment that provides the necessary information for the agent to make a decision. This could be anything from the position of pieces on a chessboard to sensor readings from a robot or pixel data from a game screen.
Based on the observed state St, the agent selects an action, At, from a set of available actions. An action is a choice the agent can make to influence the environment. Examples include moving a robot arm, making a move in a game, or adjusting a thermostat.
After the agent performs action At in state St, the environment transitions to a new state, St+1, at the next time step. Along with the new state, the environment provides a numerical reward, Rt+1, to the agent. This reward signal indicates how good or bad the action At taken in state St was in the short term. Positive rewards typically correspond to desirable outcomes, while negative rewards (or costs) signify undesirable ones.
The agent's behavior is defined by its policy, denoted by π. The policy is the agent's strategy, mapping states to actions. It dictates which action the agent chooses when it finds itself in a particular state. A policy can be deterministic, written as a=π(s), meaning it always outputs the same action for a given state. Or it can be stochastic, written as π(a∣s)=P(At=a∣St=s), specifying the probability of taking each possible action a in state s.
The overarching goal of the agent is not just to maximize the immediate reward Rt+1, but to maximize the total amount of reward it accumulates over the long run. This cumulative reward, starting from time step t, is often called the return, denoted Gt. A common formulation is the discounted return:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯=k=0∑∞γkRt+k+1Here, γ is the discount factor, a number between 0 and 1 (0≤γ≤1). It determines the present value of future rewards. A γ close to 0 makes the agent "myopic," focusing mainly on immediate rewards, while a γ close to 1 makes the agent "farsighted," striving for high total reward over the long term. The discount factor also helps keep the return finite in tasks that might continue indefinitely.
This continuous cycle of state-action-reward forms the core interaction loop in Reinforcement Learning.
The agent-environment interaction loop. The agent observes the state, selects an action, the environment transitions to a new state and provides a reward, which the agent then observes to inform future actions.
Understanding these fundamental components agent, environment, state, action, reward, and policy is essential. The algorithms we'll discuss in subsequent chapters, like Q-learning, DQN, REINFORCE, and Actor-Critic methods, all operate within this framework, providing different ways for the agent to learn an effective policy that maximizes its expected return. The limitations of representing policies or value functions directly in tables for large problems, which we'll touch upon next, motivate the need for the function approximation techniques central to this course.
© 2025 ApX Machine Learning