Training LMs with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS 2022), Vol. 35DOI: 10.48550/arXiv.2203.02155 - 一篇具有里程碑意义的论文,展示了如何使用PPO通过人类偏好微调大型语言模型,构成了LLM中RLHF的基础。