Red Teaming and Safety Testing

Was this section helpful?

References

GPT-4 System Card, OpenAI, 2023 (OpenAI) - Presents practices and findings from red teaming GPT-4, detailing methods to identify and mitigate model harms.
Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson, 2023 arXiv preprint arXiv:2307.15043 DOI: 10.48550/arXiv.2307.15043 - Introduces methods for automatically generating adversarial prompts to bypass safety controls in language models.