Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017arXiv preprint arXiv:1706.03741DOI: 10.48550/arXiv.1706.03741 - Foundational paper introducing Reinforcement Learning from Human Feedback (RLHF), detailing the process of learning a reward function from pairwise human preferences, which is directly analogous to RLAIF's reward function design.
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash, 2024Proceedings of the 41st International Conference on Machine Learning, Vol. 235 (PMLR)DOI: 10.48550/arXiv.2309.00267 - Focuses on Reinforcement Learning from AI Feedback (RLAIF) as a method to scale preference-based learning, detailing the generation of AI preferences and their use in training reward models for improved LLM alignment.