Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - 讨论了通过红队测试系统性识别和减轻LLM在部署前的安全风险的方法,这是确保安全的重要部署前步骤。
Holistic Evaluation of Language Models, Rishi Bommasani, Percy Liang, Tony Lee, Kathleen K. Lee, Jason Portenoy, Asli Celikyilmaz, Yizhong Wang, Emily Alsentzer, Danqi Chen, David Liang, Tatsunori Hashimoto, Yilun Du, Kevin L. Jarrett, Karan Goel, Peter Henderson, Jean-Benoit P. Goulard, Steven Wang, Michael S. Bernstein, Matei Zaharia, Emma Brunskill, Yejin Choi, Christopher D. Manning, Jure Leskovec, Sanmi Koyejo, Chelsea Finn, Andrew Y. Ng, 2023Transactions on Machine Learning Research, Vol. 1 (MLOSS Foundation)DOI: 10.48550/arXiv.2211.09110 - 提出了一个全面评估LLM的框架,涵盖不同场景和指标,为设计部署前安全评估和持续监控提供了重要原则。