Previous chapters concentrated on reinforcement learning agents that improve through active interaction with an environment. This chapter shifts to Offline Reinforcement Learning, often called Batch RL. Here, the challenge is to learn effective policies solely from a static, pre-collected dataset of transitions, without any opportunity for further environment interaction.
We will begin by outlining the motivations for offline learning and highlighting its key differences from the online and off-policy settings discussed earlier. A central difficulty in Offline RL is distributional shift: the mismatch between the distribution of states and actions in the fixed dataset (generated by some behavior policy πb) and the distributions induced by the new policies being evaluated or learned.
This chapter will cover:
7.1 Introduction to Offline RL (Batch RL)
7.2 Differences from Online and Off-Policy RL
7.3 Challenge: Distributional Shift
7.4 Off-Policy Evaluation in the Offline Setting
7.5 Importance Sampling and its Limitations
7.6 Fitted Q-Iteration (FQI) Approaches
7.7 Policy Constraint Methods
7.8 Batch-Constrained Deep Q-learning (BCQ)
7.9 Value Regularization Methods
7.10 Conservative Q-Learning (CQL)
7.11 Offline RL Implementation Considerations
7.12 Offline RL Algorithm Practice
© 2025 ApX Machine Learning