Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - 本文介绍了宪法AI(CAI),并阐述了它如何利用宪法指导的AI反馈(RLAIF)来对齐语言模型,直接涵盖了本节的核心主题。
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862 (arXiv)DOI: 10.48550/arXiv.2204.05862 - 本文详细介绍了如何使用人类反馈强化学习(RLHF)来训练语言模型以提高其有用性和无害性,为RLAIF提供了相关的基础原则。
Constrained Policy Optimization, Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel, 2017Proceedings of the 34th International Conference on Machine Learning, Vol. 70 (PMLR) - 本文介绍了在安全约束下优化策略的方法,这对于通过宪法惩罚或硬性约束来调整RLAIF奖励函数非常相关。