The 'Return', Gt, is defined as the total reward an agent expects to accumulate starting from time step t. For tasks that have a definite end (episodic tasks), this is simply the sum of all rewards from t+1 until the episode terminates.
However, many interesting RL problems involve ongoing interactions without a natural endpoint (continuing tasks). Imagine an agent managing a power grid or a trading bot operating continuously. If we simply sum rewards indefinitely, the Return Gt could easily become infinite. Comparing infinite returns doesn't help us determine which policy is better. Furthermore, even in episodic tasks, should a reward received 100 steps into the future be valued the same as a reward received right now?
This brings us to the concept of discounting. We introduce a parameter, the discount factor, denoted by the Greek letter gamma (γ), to systematically reduce the value of future rewards.
The discount factor γ is a number between 0 and 1 (0≤γ≤1). The return Gt is now defined as the sum of discounted future rewards:
Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+⋯=k=0∑∞γkRt+k+1
Let's break down this definition:
- The reward received one step later, Rt+1, is added directly (multiplied by γ0=1).
- The reward received two steps later, Rt+2, is multiplied by γ.
- The reward received three steps later, Rt+3, is multiplied by γ2.
- And so on; the reward received k+1 steps later, Rt+k+1, is multiplied by γk.
Since γ is less than or equal to 1, γk gets smaller as k increases. This means rewards received further in the future contribute less to the total discounted return Gt.
Why Discount?
Discounting serves several important purposes:
- Mathematical Convenience: In continuing tasks, discounting ensures that the infinite sum defining Gt converges to a finite value, provided the rewards are bounded. If the reward at any step is guaranteed to be no larger than some value Rmax, then the return Gt is guaranteed to be less than or equal to 1−γRmax (using the formula for the sum of an infinite geometric series). This makes comparing different policies mathematically tractable.
- Modeling Preference for Near-Term Rewards: Often, receiving a reward sooner is preferable to receiving the same reward later. Think about financial interest rates or biological imperatives for immediate survival. Discounting naturally models this preference.
- Handling Uncertainty: The future is inherently uncertain. The agent's model of the environment might be imperfect, or the environment itself might change unexpectedly. Applying a discount factor implicitly accounts for this increasing uncertainty over longer time horizons; rewards far in the future are less certain and thus valued less.
Choosing the Value of Gamma (γ)
The choice of γ significantly impacts the agent's behavior:
- If γ=0: The agent becomes completely 'myopic'. It only cares about maximizing the immediate reward Rt+1. It ignores all future consequences of its actions. The return is simply Gt=Rt+1.
- If γ is close to 1 (e.g., 0.99): The agent is 'far-sighted'. Future rewards are valued almost as highly as immediate rewards. The agent considers the long-term consequences of its actions more heavily.
- If γ is between 0 and 1: The agent balances immediate and future rewards. Smaller values prioritize short-term gain, while larger values emphasize long-term accumulation.
The specific value of γ is often treated as a hyperparameter of the problem setup, chosen based on the nature of the task. For example, in financial applications, γ might relate to prevailing interest rates. In episodic tasks, γ=1 (no discounting) is sometimes used if the episodes are guaranteed to terminate, although using γ<1 can often lead to faster learning.
The following chart illustrates how the discount factor γk diminishes the value of rewards over future time steps k for different values of γ:
The weight given to a reward received k steps in the future decreases exponentially with k. A higher γ leads to slower decay, valuing future rewards more.
In Reinforcement Learning, the ultimate objective is typically to find a policy π that maximizes the expected discounted return from each state. The concept of discounting is therefore fundamental to defining the very goal we are trying to achieve and enables the formalisms, like the value functions we'll discuss next, that allow us to solve MDPs.