Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2203.02155 - 这篇基础论文概述了通过人类反馈强化学习(RLHF)使大型语言模型与人类指令和偏好对齐的完整流程,详细介绍了监督微调、奖励建模和基于PPO的强化学习。
Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2009.01325 - 一项早期重要工作,展示了RLHF流程在特定语言任务中的应用:训练模型生成与人类偏好对齐的摘要。