Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, Inc.)DOI: 10.55917/cb47-601e - 本文介绍了通过人类对轨迹片段的反馈来训练奖励函数的方法,为人类反馈强化学习(RLHF)技术奠定了基础。
Training language models to follow instructions with human feedback, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems, Vol. 35 (Curran Associates, Inc.)DOI: 10.48550/arXiv.2203.02155 - 本文详细介绍了InstructGPT模型,展示了人类反馈强化学习(RLHF)如何有效地将大型语言模型与用户指令和偏好对齐,并强调了其过程和对人类数据的需求。
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandhini Agarwal, Andy Jones, Anna Chen, Cameron McKinnon, Carole-Anne Razavi, Edouard Charette, Jackson Kernion, Jeremiah Kaplan, Kristen Hilton, Lee Sharkey, Maciej Korbak, Martin Wattenberg, Micah Rosenkranz, Morningstar Anguige, Nikhil Chelluri, Nicholas Schiefer, Nicole Sanchez, Sam Bowman, Scott McGhew, Shauna Gordon-Nunez, Stephen Casper, Stephen Marcus, Tom Brown, Tamera Lanham, Zac Hatfield-Dodds, Ben Mann, Amanda Askell, Jack Clark, Sam McCandlish, Dario Amodei, and Jared Kaplan, 2023International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2212.08073 - 这篇基础性论文介绍了宪法式AI,这是一种利用AI反馈(RLAIF)通过一套原则来训练模型以实现无害化的方法,直接解决了人类反馈的可扩展性和成本挑战。