Previous chapters detailed methods where agents learn through continuous interaction with their environment, collecting data and updating their strategies in a loop. We now turn our attention to a different, yet increasingly important, learning paradigm: Offline Reinforcement Learning, also commonly referred to as Batch Reinforcement Learning.
Imagine scenarios where deploying an exploratory RL agent directly into the real system is impractical, unsafe, or prohibitively expensive. Consider:
In these cases, and many others, we are presented with a fixed dataset of interactions collected beforehand, possibly by a different policy or even multiple policies. The objective is to leverage this static dataset to learn the best policy possible without any further interaction with the environment during the learning process. This is the core idea behind Offline RL.
Formally, the setup for Offline RL is distinct from the online setting. We are given a static dataset D, which consists of a collection of transitions:
D={(si,ai,ri,si′)}i=1NHere, si is the state, ai is the action taken in that state, ri is the received reward, and si′ is the resulting next state. This dataset D was collected using one or more behavior policies, collectively denoted as πb. Importantly, we might not know the exact policy πb that generated the data. The crucial constraint is that our learning algorithm can only use the transitions present in D. It cannot query the environment for the outcome of new state-action pairs.
The goal is to use this dataset D to learn a target policy π that maximizes the expected cumulative reward when deployed in the actual environment.
The Offline Reinforcement Learning process: Data is collected from environment interactions using behavior policies, forming a static dataset. The Offline RL algorithm learns a new policy solely from this dataset, without further environment interaction.
It's important to distinguish Offline RL from related concepts:
Why is learning purely from a static dataset so challenging? The primary obstacle is distributional shift. The fixed dataset D reflects the state-action visitation frequency induced by the behavior policy πb. If our learning algorithm tries to evaluate or improve a target policy π that behaves differently from πb, π might favor actions or lead to states that are poorly represented or entirely absent in D.
Consider training a Q-network using standard off-policy algorithms like DQN on the offline dataset D. The Bellman update involves terms like maxa′Q(s′,a′). If the state s′ is encountered in the dataset, but the action a′=argmaxa′Q(s′,a′) was never taken by πb in state s′, the Q-value Q(s′,a′) is estimated purely through function approximation extrapolation. Neural networks are notoriously unreliable when extrapolating outside the distribution of their training data. This can lead to wildly inaccurate Q-value estimates, often significant overestimation, causing the learning process to diverge or converge to a poor policy. This failure mode, stemming from querying the value of actions not present in the batch for specific states, is a direct consequence of distributional shift.
Addressing this challenge is the central theme of modern Offline RL research. The subsequent sections will examine techniques developed specifically to handle distributional shift, including methods for evaluating policies offline (Offline Policy Evaluation) and algorithms designed for robust offline policy optimization.
© 2025 ApX Machine Learning