Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862DOI: 10.48550/arXiv.2204.05862 - This paper details the Reinforcement Learning from Human Feedback (RLHF) framework, including the architecture, loss function, and training methodology for reward models (preference models) based on human preferences.
Constitutional AI: Harmlessness from AI Feedback, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - This work describes how an AI feedback mechanism, guided by a 'constitution,' can generate the preference data that is then used to train the preference model, forming the basis of RLAIF.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021arXiv preprint arXiv:2106.09685DOI: 10.48550/arXiv.2106.09685 - Introduces Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, which is highly relevant for adapting large language models as preference models.
Learning to Rank using Gradient Descent, Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Amit Lazier, Matt Deeds, Nicole Holman, Ding Zhou, 2005Proceedings of the 22nd International Conference on Machine Learning (ICML) (ACM)DOI: 10.1145/1102351.1102432 - This foundational paper introduces RankNet, which employs a pairwise ranking loss function that is mathematically similar to the Bradley-Terry model-inspired objective used for training preference models.