Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray, 2022Advances in Neural Information Processing Systems, Vol. 35DOI: 10.48550/arXiv.2203.02155 - A foundational paper describing the Reward Model architecture and training process as part of the RLHF pipeline for aligning large language models. It covers the use of a pre-trained LLM backbone and a scalar regression head to learn human preferences.
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandeep Trehan, John Chu, Long Nguyen, Andy Jones, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Etienne Fort, Zac Hatfield-Dodds, Danny Hernandez, Andrew Jones, Nicholas Joseph, Nelson Elhage, Zac Evans, Liane Lovitt, Cameron McKnight, Da Yan, Daniela Amodei, Sam McCandlish, Dario Amodei, and Tom Brown, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - Describes an approach to aligning LLMs using AI feedback, which heavily relies on training a Reward Model from human preferences (referred to as a "preference model") to evaluate responses for helpfulness and harmlessness.
Learning to summarize with human feedback, Mark Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Jeffrey Schulman, Harish Agarwal, Noah Fiedel, Basri B. Erdogdu, and Kai Guo, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.)DOI: 10.55919/00735 - An early work applying Reward Modeling and RLHF to text summarization. It effectively illustrates the core principles of using transformer models to learn human preferences for text generation quality.
A Survey of Reinforcement Learning from Human Feedback, Yuanzhi Zhao, Zili Wang, Yuxin Li, Runji Lin, Xuanfan Ni, Shangqian Leng, and Jiangjie Chen, 2023arXiv preprint arXiv:2312.09114DOI: 10.48550/arXiv.2312.09114 - A comprehensive survey reviewing various aspects of RLHF, including detailed discussions on different Reward Model architectures, training techniques, and their role in the overall alignment pipeline.