Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXivDOI: 10.48550/arXiv.1707.06347 - This foundational paper introduces the Proximal Policy Optimization (PPO) algorithm, providing the theoretical basis for using a constrained objective to maintain policy stability during reinforcement learning.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - This influential paper details the application of PPO with KL divergence regularization for training large language models with human feedback, demonstrating its effectiveness in aligning models with human preferences.
Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020NeurIPS 2020DOI: 10.48550/arXiv.2009.01325 - An early application of reinforcement learning from human feedback to text summarization, this work highlights the use of the KL divergence penalty to balance reward maximization with maintaining coherence and quality in generated text.
TRL (Transformer Reinforcement Learning) Library Documentation, Hugging Face, 2023 (Hugging Face) - The official documentation for Hugging Face's TRL library, which provides practical implementations of RLHF, including PPO with KL regularization and adaptive KL controllers mentioned in the text.