Training Language Models to Follow Instructions with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Describes the complete RLHF pipeline, including the role of the reward model in generating a scalar reward signal for policy optimization and the KL penalty.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the PPO algorithm, which is central to the policy update mechanism discussed for maximizing reward.
TRL Documentation, Hugging Face, 2024 (Hugging Face) - Offers practical examples and documentation for implementing RLHF, including how the reward model is used in the trl framework for PPO training.