Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, Inc.)DOI: 10.55917/cb47-601e - This paper introduces the method of training a reward function from human feedback on trajectory segments, laying the groundwork for Reinforcement Learning from Human Feedback (RLHF) techniques.
Training language models to follow instructions with human feedback, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems, Vol. 35 (Curran Associates, Inc.)DOI: 10.48550/arXiv.2203.02155 - This paper details the InstructGPT model, demonstrating how Reinforcement Learning from Human Feedback (RLHF) can effectively align large language models with user instructions and preferences, highlighting the process and its human data requirements.
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandhini Agarwal, Andy Jones, Anna Chen, Cameron McKinnon, Carole-Anne Razavi, Edouard Charette, Jackson Kernion, Jeremiah Kaplan, Kristen Hilton, Lee Sharkey, Maciej Korbak, Martin Wattenberg, Micah Rosenkranz, Morningstar Anguige, Nikhil Chelluri, Nicholas Schiefer, Nicole Sanchez, Sam Bowman, Scott McGhew, Shauna Gordon-Nunez, Stephen Casper, Stephen Marcus, Tom Brown, Tamera Lanham, Zac Hatfield-Dodds, Ben Mann, Amanda Askell, Jack Clark, Sam McCandlish, Dario Amodei, and Jared Kaplan, 2023International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2212.08073 - This foundational paper introduces Constitutional AI, a method that uses AI feedback (RLAIF) to train models to be harmless by aligning them with a set of principles, directly addressing the scalability and cost challenges of human feedback.