Reward Model Training: Architectures and Loss Functions
Was this section helpful?
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - Presents an approach to LLM alignment that leverages AI feedback instead of extensive human labeling, offering insights into reward modeling for safety and advanced alignment objectives.
TRL (Transformers Reinforcement Learning) Library Documentation, Hugging Face, 2024 (Hugging Face) - Provides official documentation and practical guides for implementing reward models and the RLHF pipeline using the trl library, which supports the architectures and loss functions discussed.
Deep Reinforcement Learning from Human Preferences, Christiano, Paul F., Leike, Jan, Brown, Tom B., Martic, Miljan, Legg, Shane, and Amodei, Dario, 2017Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS)DOI: 10.48550/arXiv.1706.03741 - A foundational paper introducing the concept of training deep reinforcement learning agents using human preferences to learn a reward function, which is a precursor to modern RLHF applications for language models.