In the previous section, we defined the 'Return', Gt, as the total reward an agent expects to accumulate starting from time step t. For tasks that have a definite end (episodic tasks), this is simply the sum of all rewards from t+1 until the episode terminates.
However, many interesting RL problems involve ongoing interactions without a natural endpoint (continuing tasks). Imagine an agent managing a power grid or a trading bot operating continuously. If we simply sum rewards indefinitely, the Return Gt could easily become infinite. Comparing infinite returns doesn't help us determine which policy is better. Furthermore, even in episodic tasks, should a reward received 100 steps into the future be valued the same as a reward received right now?
This brings us to the concept of discounting. We introduce a parameter, the discount factor, denoted by the Greek letter gamma (γ), to systematically reduce the value of future rewards.
The discount factor γ is a number between 0 and 1 (0≤γ≤1). The return Gt is now defined as the sum of discounted future rewards:
Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+⋯=k=0∑∞γkRt+k+1Let's break down this definition:
Since γ is less than or equal to 1, γk gets smaller as k increases. This means rewards received further in the future contribute less to the total discounted return Gt.
Discounting serves several important purposes:
The choice of γ significantly impacts the agent's behavior:
The specific value of γ is often treated as a hyperparameter of the problem setup, chosen based on the nature of the task. For example, in financial applications, γ might relate to prevailing interest rates. In episodic tasks, γ=1 (no discounting) is sometimes used if the episodes are guaranteed to terminate, although using γ<1 can often lead to faster learning.
The following chart illustrates how the discount factor γk diminishes the value of rewards over future time steps k for different values of γ:
The weight given to a reward received k steps in the future decreases exponentially with k. A higher γ leads to slower decay, valuing future rewards more.
In Reinforcement Learning, the ultimate objective is typically to find a policy π that maximizes the expected discounted return from each state. The concept of discounting is therefore fundamental to defining the very goal we are trying to achieve and enables the formalisms, like the value functions we'll discuss next, that allow us to solve MDPs.
© 2025 ApX Machine Learning