Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization (PPO) algorithm, a method for stable and efficient policy optimization in deep reinforcement learning, highly relevant for RLHF.
Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Demonstrates the application of Reinforcement Learning from Human Feedback (RLHF) and PPO to align large language models with human preferences, yielding models such as InstructGPT. Note: The full author list for this paper is exceptionally long and contains repetitions in the source; a representative subset is provided.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - Provides a thorough introduction to reinforcement learning concepts, essential for understanding the theoretical basis of algorithms used in policy optimization like PPO.