Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - 本文介绍了宪法AI,这是本课程的核心主题,并提供了AI反馈强化学习(RLAIF)的早期示例,讨论了通过自我纠正使AI与人类价值观对齐的挑战。
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862DOI: 10.48550/arXiv.2204.05862 - 本文详细介绍了使用RLHF训练一个有用且无害助手的过程,涵盖了偏好建模和对齐挑战,这些内容与RLAIF的失败模式(如奖励作弊)高度相关。
Concrete Problems in AI Safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, 2016arXiv preprint arXiv:1606.06565 (arXiv)DOI: 10.48550/arXiv.1606.06565 - 本文概述了AI安全领域的几项重大挑战,包括奖励作弊和规范博弈,为理解RLAIF等对齐工作中的常见失败模式提供了基础背景。