Training language models to follow instructions with human feedback, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems 35, Vol. 35 - 这篇论文介绍了InstructGPT模型,并详细阐述了完整的三阶段RLHF流程--包括监督微调(SFT)、使用人类偏好训练奖励模型(RM),以及使用PPO和KL惩罚进行强化学习微调--用于对齐大型语言模型。