Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Ouyang, Long, Wu, Jeff, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll L., Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, Schulman, John, Hilton, Jacob, Kelton, Fraser, Miller, Luke, Simens, Maddie, Askell, Amanda, Welinder, Peter, Christiano, Paul, Leike, Jan, Lowe, Ryan, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Introduces the InstructGPT model, detailing the RLHF pipeline with the KL divergence penalty to align language models with human preferences.
Proximal Policy Optimization Algorithms, Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov, Oleg, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Presents the foundational Proximal Policy Optimization (PPO) algorithm, a core element of the RLHF process for policy updates.
Learning to Summarize with Reinforcement Learning from Human Feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020NeurIPS 2020DOI: 10.48550/arXiv.2009.01325 - An early and influential paper demonstrating the application of RLHF for text summarization, explicitly using KL divergence as a penalty to maintain text quality.