Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXivDOI: 10.48550/arXiv.2203.02155 - Details the Reinforcement Learning from Human Feedback (RLHF) pipeline, explaining the roles of policy, reference, reward, and value models and their sequential training.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv (arXiv)DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization (PPO) algorithm, central to the RL fine-tuning phase of RLHF, including concepts like policy updates and KL divergence.
TRL Documentation, Hugging Face, 2024 (Hugging Face) - Official documentation for the TRL library, providing practical guides for setting up and utilizing RLHF components like AutoModelForCausalLMWithValueHead and PPOTrainer.
Transformers Documentation, Hugging Face, 2024 (Hugging Face) - Official documentation for the Hugging Face Transformers library, covering general model loading, model architectures, and checkpoint management relevant for RLHF.