Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandeep Trehan, John Chu, Long Nguyen, Andy Jones, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Etienne Fort, Zac Hatfield-Dodds, Danny Hernandez, Andrew Jones, Nicholas Joseph, Nelson Elhage, Zac Evans, Liane Lovitt, Cameron McKnight, Da Yan, Daniela Amodei, Sam McCandlish, Dario Amodei, and Tom Brown, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - 描述了一种利用AI反馈对齐大型语言模型的方法,该方法高度依赖于从人类偏好训练奖励模型(称为“偏好模型”),以评估回复的有用性和无害性。
Learning to summarize with human feedback, Mark Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Jeffrey Schulman, Harish Agarwal, Noah Fiedel, Basri B. Erdogdu, and Kai Guo, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.)DOI: 10.55919/00735 - 一项早期工作,将奖励模型和RLHF应用于文本摘要。它有效展示了使用Transformer模型学习人类对文本生成质量偏好的核心原则。