Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862DOI: 10.48550/arXiv.2204.05862 - 本文详细介绍了基于人类反馈的强化学习(RLHF)框架,包括基于人类偏好训练奖励模型(偏好模型)的架构、损失函数和训练方法。
Constitutional AI: Harmlessness from AI Feedback, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - 这项工作描述了如何通过由“宪法”指导的AI反馈机制生成偏好数据,这些数据随后用于训练偏好模型,构成RLAIF的基础。
Learning to Rank using Gradient Descent, Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Amit Lazier, Matt Deeds, Nicole Holman, Ding Zhou, 2005Proceedings of the 22nd International Conference on Machine Learning (ICML) (ACM)DOI: 10.1145/1102351.1102432 - 这篇基础论文介绍了RankNet,它采用的成对排序损失函数与训练偏好模型所用的基于Bradley-Terry模型的客观函数在数学上相似。