Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - 这篇有影响力的论文详细介绍了将PPO与KL散度正则化应用于通过人类反馈训练大型语言模型,展示了其在使模型与人类偏好对齐方面的有效性。