In Reinforcement Learning, an agent interacts with its environment over a sequence of time steps. At each step t, the agent observes a state St, chooses an action At, receives a reward Rt+1, and transitions to a new state St+1. We've established the immediate feedback mechanism through the reward function R. However, the agent's objective isn't just about maximizing the immediate reward Rt+1. Intelligent behavior often requires considering the long-term consequences of actions. An action might yield a small immediate reward but lead to states offering much larger future rewards. Conversely, a large immediate reward might lead to unfavorable future states.
To capture this notion of long-term cumulative reward, we introduce the concept of the Return. The return, denoted Gt, is the total reward an agent expects to accumulate starting from time step t onwards. The precise way we calculate the return depends on whether the task is episodic or continuing.
Episodic tasks are those that have a natural ending point. Think of games like chess or Pac-Man, where each game eventually terminates. In these tasks, the sequence of states, actions, and rewards forms an episode: S0,A0,R1,S1,A1,R2,...,ST−1,AT−1,RT,ST, where ST is a special terminal state.
For an episodic task, the return Gt starting from time step t is simply the sum of all rewards received from step t+1 until the end of the episode at step T:
Gt=Rt+1+Rt+2+Rt+3+⋯+RTThis can be written more compactly as:
Gt=k=0∑T−t−1Rt+k+1Here, T is the final time step of the episode. The goal of the agent in an episodic task is to choose actions that maximize the expected value of this finite sum of future rewards for each episode.
Continuing tasks, on the other hand, do not have a terminal state and could potentially go on forever. Examples include controlling a robot performing an ongoing maintenance task or managing an investment portfolio. If we simply summed the rewards as we did for episodic tasks, the return Gt could easily become infinite (T→∞), making it difficult to compare different strategies. How can we meaningfully define a cumulative reward that doesn't diverge?
This is where the concept of discounting becomes essential. We introduce a discount factor, denoted by the Greek letter gamma (γ), where 0≤γ≤1. The idea is to give less weight to rewards received further in the future compared to immediate rewards.
The discounted return Gt is defined as the sum of future rewards, where each reward Rt+k+1 is multiplied by γk:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯=k=0∑∞γkRt+k+1Why does this help?
Interestingly, the discounted return formula Gt=∑k=0∞γkRt+k+1 can be seen as a general case. For episodic tasks, we can consider them to end by transitioning to an absorbing terminal state that only yields rewards of 0 from then on. The formula still applies, and if γ=1, it simplifies back to the undiscounted sum Gt=∑k=0T−t−1Rt+k+1. However, using γ<1 is common even in episodic tasks to encourage finding the terminal state faster or to ensure convergence in certain algorithms.
Therefore, the fundamental goal in most RL problems framed within MDPs is to find a policy (a way of choosing actions) that maximizes the expected discounted return E[Gt] from each state St. This quantity, the expected return under a specific policy, is precisely what we will define as the value of a state or state-action pair in subsequent sections. Understanding the return Gt is the first step towards evaluating how good different situations and actions are in the long run. We will explore the role of the discount factor γ in more detail next.
© 2025 ApX Machine Learning