Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - Comprehensive introduction to reinforcement learning fundamentals, including Markov Decision Processes, policy gradient methods, and value functions. Essential for foundational understanding.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization (PPO) algorithm, detailing its clipped surrogate objective and advantages for stable policy optimization, which is central to RLHF.
Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Describes the Reinforcement Learning from Human Feedback (RLHF) pipeline, specifically detailing the use of PPO with a KL divergence penalty for aligning language models with human preferences.