Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXivDOI: 10.48550/arXiv.2203.02155 - 描述了从人类反馈中进行强化学习(RLHF)的流程,解释了策略模型、参考模型、奖励模型和价值模型的作用及其阶段性训练。