Previous chapters concentrated on reinforcement learning agents that improve through active interaction with an environment. This chapter shifts to Offline Reinforcement Learning, often called Batch RL. Here, the challenge is to learn effective policies solely from a static, pre-collected dataset of transitions, without any opportunity for further environment interaction.
We will begin by outlining the motivations for offline learning and highlighting its key differences from the online and off-policy settings discussed earlier. A central difficulty in Offline RL is distributional shift: the mismatch between the distribution of states and actions in the fixed dataset (generated by some behavior policy πb) and the distributions induced by the new policies being evaluated or learned.
This chapter will cover:
© 2025 ApX Machine Learning