Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Karthik Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, S.K. Sutskever, Amanda Askell, Sarita Char, Janelle Shane, Brian Mcmahan, Noah Fiedel, Paul Christiano, Geoff Irving, Kyle Scott, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - 介绍了基于人类反馈的强化学习(RLHF),本节指出其因持续依赖人类偏好标注而面临扩展挑战。
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Roslyn Campbell, Anna Chen, Dawn Drain, Deep Ganguli, Andy Jones, Nicholas Joseph, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tom Henighan, Brian Hutchinson, Rita Johnston, Abhishek Karkhanis, Jeremy Kim, Carol Chen, Kristóf T. Garamvölgyi, Sam McCandlish, Chris Olah, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Samuel R. Bowman, Kevin Scott, Shauna Gordon-McKeon, Lauren Hume, Michael Johnston, Ben Mann, Amanda Ngo, Arvind Neelakantan, Long Ouyang, Catherine Perez, Nicholas Schiefer, Justin Shlegeris, Stephanie Sclafani, Gabe Selsky, Sam Ringer, Mike Smith, Jordan Schneider, Noah Shinn, Brooke Smyth, Stephen McAleer, Andrew Trask, Jon Uesato, Jeff Wu, Danny Wu, Steven T. Young, Evan Hubinger, 2022arXiv preprint arXiv:2212.08073 (arXiv)DOI: 10.48550/arXiv.2212.08073 - 直接提出了宪法式AI,作为一种可扩展对齐方法,专门利用AI反馈来减少对大量人工监督的需求。