Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022 (arXiv)DOI: 10.48550/arXiv.2203.02155 - 这篇基础性论文介绍了大型语言模型的强化学习人类反馈(RLHF)流程,详细阐述了奖励模型的作用以及在使模型行为与人类偏好保持一致方面的固有挑战。