Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - 详细介绍了对大型语言模型进行红队测试以识别安全弱点的方法,这是主动事件检测和安全机制设计的一部分。
GPT-4 System Card, OpenAI, 2023 (OpenAI) - 概述了为GPT-4实施的安全工作、挑战和缓解策略,包括风险评估、红队测试和持续监控,提供了LLM安全操作的见解。