Training Language Models to Follow Instructions with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprintDOI: 10.48550/arXiv.2203.02155 - 展示PPO如何用于通过人类反馈(RLHF)使大型语言模型与人类偏好对齐的论文。
Fine-tune a LLaMA model with 🤗PEFT & 🤗TRL, Edward Beeching, Younes Belkada, Leandro von Werra, Sourab Mangrulkar, Lewis Tunstall, Kashif Rasul, 2023 (Hugging Face Blog) - 使用TRL进行大型语言模型微调的PPO实用指南,包含超参数细节。