Training language models to follow instructions with human feedback, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems, Vol. 35 (NeurIPS)DOI: 10.48550/arXiv.2203.02155 - 这篇论文介绍了InstructGPT模型和RLHF流程,详细阐述了奖励模型架构、训练目标及其在使大型语言模型与人类指令对齐方面的作用。
Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F Christiano, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates) - 这是RLHF在文本生成领域最早的杰出应用之一,展示了如何通过人类偏好训练奖励模型来指导摘要模型。
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862DOI: 10.48550/arXiv.2204.05862 - 这项工作进一步探索了RLHF流程以对齐大型语言模型,侧重于有益性和无害性原则,并提供了关于奖励模型训练挑战的见解。