Deep Reinforcement Learning from Human Preferences, Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017Advances in Neural Information Processing Systems 30, Vol. 30 (Curran Associates, Inc.) - This seminal paper introduced the method of learning reward functions from human feedback, specifically pairwise comparisons, a core technique later adapted for language model alignment.
Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020NeurIPS 2020DOI: 10.48550/arXiv.2009.01325 - This foundational paper demonstrates the application of reinforcement learning from human feedback, using pairwise preferences to train a reward model for text summarization.
Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, 2012 (The MIT Press) - Chapter 28 provides a comprehensive discussion on probabilistic models for ranking and ordinal regression, including the Bradley-Terry model for pairwise preferences.