Red Teaming Language Models to Reduce Harms: Methods, Limitations, and Ethical Considerations, Deep Ganguli, Liane Lovitt, Myra Cheng, Amanda Askell, Yuntao Bai, Anna Chen, Nicole G. Lee, Nicholas Joseph, Saurav Prakash, Dawn Song, Igor Sutskever, Sam McCandlish, Danny Hernandez, Jared Kaplan, Ashley Pilipchuk, Jackson Kernion, Shauna Gordon-McKeon, Nicholas Schiefer, Kris Jordan, Sam Clarke, Nathan Lambert, Steven Basart, Sheer El-Showk, Nelson F. Liu, Ben Mann, Sandhini Agarwal, Thomas Henighan, Sam Ringer, Scott Johnston, Brian Israel, Christian Mott, Josh Jacobson, Kevin Robinson, Jeffrey Ladish, Tom Brown, Yike Lu, Camden Sikes, Stella Biderman, Esin Durmus, Zac Hatfield-Dodds, Aidan C. Clark, Eli Collins, Ben Edelman, Lora Han, Kamal Ndousse, Michael P. Kim, Thomas K. Neil, Eric J. Michaud, Daniel M. Ziegler, Danny M. Hernandez, Jeremy Kim, Sam R. Bowman, 2022arXiv preprint arXiv:2209.07858DOI: 10.48550/arXiv.2209.07858 - 由领先的AI安全研究机构撰写,介绍了LLM红队测试,包括方法、局限性和伦理考量。