Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan, 2022arXiv preprint arXiv:2204.05862DOI: 10.48550/arXiv.2204.05862 - Introduces the Helpful, Honest, and Harmless (HHH) framework and details Reinforcement Learning from Human Feedback (RLHF) for LLM alignment.
Concrete Problems in AI Safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, 2016arXiv preprint arXiv:1606.06565DOI: 10.48550/arXiv.1606.06565 - A foundational paper outlining various AI safety challenges, including specification problems, distributional shift, and undesirable emergent behavior, highly relevant to alignment.