Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - Details approaches for red teaming large language models to identify safety weaknesses, a component of proactive incident detection and safety mechanism design.
GPT-4 System Card, OpenAI, 2023 (OpenAI) - Outlines the safety endeavors, difficulties, and mitigation tactics implemented for GPT-4, including risk evaluation, red teaming, and continuous monitoring, offering insight into LLM safety operations.