Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - 展示了如何将人类反馈强化学习 (RLHF) 和 PPO 应用于大型语言模型,以使其与人类偏好对齐,从而产生了如 InstructGPT 等模型。注意:此论文的完整作者列表异常冗长且源文件中存在重复;此处提供了一个有代表性的子集。