Training Language Models to Follow Instructions with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - 描述了完整的InstructGPT管道,详细介绍了监督微调(SFT)、奖励模型(RM)训练和使用近端策略优化(PPO)的强化学习阶段,为RLHF系统提供了概念框架。