Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022 (arXiv)DOI: 10.48550/arXiv.2203.02155 - This foundational paper introduces the Reinforcement Learning from Human Feedback (RLHF) pipeline for large language models, detailing the reward model's role and the inherent challenges in aligning model behavior with human preferences.
Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017DOI: 10.48550/arXiv.1706.03741 - This paper presents an early and influential method for training agents using human preferences as a reward signal, discussing the collection of preference data and the challenges of learning a reliable reward model from subjective human judgments.
Concrete Problems in AI Safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, 2016DOI: 10.48550/arXiv.1606.06565 - This foundational paper outlines several key challenges in AI safety, including 'reward hacking' and 'specification gaming,' which are critical issues when an agent optimizes an imperfect proxy for the true objective.