As we've seen, reinforcement learning centers around an agent interacting with an environment over time, receiving observations (states) and rewards, and choosing actions based on its policy. However, not all RL problems unfold in the same way. The structure of the agent-environment interaction loop can differ, primarily falling into two categories: episodic tasks and continuing tasks. Understanding this distinction is important because it affects how we define the agent's goal and measure its success.
Many RL problems have interactions that naturally break down into subsequences or segments. Think of playing a game of chess, navigating a maze to find an exit, or a robot assembling a specific part. Each game, each maze run, or each assembly represents a self-contained unit of interaction. These units are called episodes.
An episodic task is characterized by the existence of one or more terminal states. When the agent reaches a terminal state, the current episode ends. After the episode concludes, the environment is typically reset, and a new episode begins, often starting from a standard initial state or a distribution of possible starting states.
Consider a simple grid world where an agent needs to navigate from a starting point 'S' to a goal 'G'.
The flow of an episodic task. Each interaction sequence runs until a terminal state is reached, then resets for the next episode.
In episodic tasks, the agent's objective is typically to maximize the total reward accumulated over the course of a single episode. This sum of rewards within an episode is often called the return. Since each episode has a finite length, the return is well-defined. We might evaluate the agent's performance by averaging the return over many episodes.
Examples of episodic tasks include:
In contrast to episodic tasks, some RL problems involve interactions that do not have a natural endpoint. The agent-environment interaction goes on continuously without breaking into identifiable episodes. Think of a system designed to continuously manage a chemical process, an algorithm making ongoing trading decisions in a financial market, or a robot powered by solar energy that needs to manage its power levels indefinitely.
These are continuing tasks. There are no terminal states. The interaction sequence can, in principle, continue forever.
The flow of a continuing task. The interaction proceeds indefinitely without designated terminal states or resets.
This poses a challenge: if the interaction never ends, how do we define the total accumulated reward? Summing rewards over an infinite sequence might lead to an infinite value, making it difficult to compare different policies.
To handle this, we often introduce the concept of discounting. Instead of simply summing rewards, we calculate a discounted return, where rewards received further in the future are given less weight than immediate rewards. We use a discount factor, typically denoted by the Greek letter gamma (γ), where 0≤γ<1. The goal becomes maximizing the sum of discounted rewards:
Gt=Rt+1+γRt+2+γ2Rt+3+⋯=k=0∑∞γkRt+k+1Here, Gt is the discounted return starting from time step t, and Rt+k+1 is the reward received k+1 steps into the future. By using a discount factor γ<1, we ensure that this sum remains finite even if the interaction continues forever (assuming rewards are bounded). Discounting also often makes sense intuitively: immediate rewards might be more valuable than rewards far off in the future. We will explore return calculations and discounting in much more detail in the next chapter on Markov Decision Processes.
Examples of continuing tasks include:
Identifying whether a task is episodic or continuing is a fundamental step in setting up an RL problem.
When approaching a new problem, one of the first questions to ask is: "Does the interaction have a natural end point?" This will guide how you frame the agent's objective and select appropriate learning techniques.
© 2025 ApX Machine Learning