Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2009.01325 - An earlier work from OpenAI demonstrating the use of human feedback for training reward models, specifically for summarization tasks, employing a similar pairwise preference learning objective.
Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03741 - A seminal work that proposed using human feedback, specifically pairwise comparisons, to train reward models for general deep reinforcement learning agents, predating its widespread application to language models.