After you've identified vulnerabilities, reported them, and collaborated on fixes, the red teaming cycle doesn't quite end. A significant part of maturing a red team operation, especially in the rapidly evolving area of Large Language Models, is to systematically document your methods. This involves recording both the general procedures you follow and the specific "plays" or attack patterns that proved effective (or even those that didn't, along with why). This knowledge base becomes an invaluable asset for future engagements, training, and improving your team's overall effectiveness.
The Value of Systematic Documentation
Before we get into the specifics, let's briefly touch upon why meticulous documentation is more than just administrative overhead. In the context of LLM red teaming:
- Consistency and Repeatability: Documented procedures ensure that assessments are conducted consistently, regardless of which team member is leading the effort. This is important for comparing results over time and across different models.
- Knowledge Transfer: LLM attack techniques are constantly emerging. A well-maintained internal knowledge base helps disseminate new findings and methods across your team efficiently. It also significantly speeds up the onboarding process for new members.
- Efficiency Gains: Why reinvent the wheel for every assessment? Documented attack plays and common vulnerability patterns allow your team to quickly apply known techniques to new targets, saving time and effort.
- Building Institutional Memory: Team members may change, but the institution's knowledge should persist and grow. Documentation acts as this memory, capturing lessons learned and successful strategies.
- Baseline for Improvement: By documenting what you did and how, you create a baseline. This allows you to track the effectiveness of your techniques, identify areas for improvement in your methodology, and adapt to new LLM architectures or defense mechanisms.
- Supporting Remediation and Retesting: Clear documentation of an attack (a "play") is immensely helpful for development teams trying to understand and fix a vulnerability. It also provides the exact steps needed for retesting to verify that a fix is effective.
Documenting Red Teaming Procedures
Procedures describe the "how-to" of your red teaming operations. They are broader than individual attack techniques and cover the various phases and supporting activities of an engagement. Your procedural documentation should be a living guide, updated as your team refines its approach.
Consider including the following elements in your procedural documentation:
- Engagement Lifecycle: Outline the standard phases of your LLM red teaming engagements, from initial planning and scope definition (as discussed in Chapter 1) through to reporting and remediation support. For each phase, describe the objectives, typical activities, and expected outputs.
- Attack Surface Identification Methods: Detail the systematic approaches your team uses to map out potential weak points in an LLM system. This could include:
- Checklists for inspecting LLM APIs.
- Processes for analyzing data ingestion pipelines.
- Methods for understanding how the LLM interacts with external systems or data sources.
- Testing Protocol for Common Vulnerabilities: For prevalent LLM vulnerabilities like prompt injection, data poisoning, or sensitive information extraction (covered in Chapters 2, 3, and 4), document your standard testing approaches. For instance, for prompt injection, your protocol might list:
- Initial reconnaissance prompts.
- A sequence of increasingly sophisticated injection techniques.
- Methods for testing against known defense patterns.
- Tooling and Environment Setup: Document the tools your team uses, both open-source and internally developed. Include:
- Setup instructions and configurations.
- Common usage patterns and examples.
- Scripts or code snippets for recurring tasks.
- Evidence Collection and Management: Specify how findings should be documented during testing. This includes:
- Formats for saving prompts and LLM responses.
- Screenshot or video capture guidelines.
- Methods for noting the conditions under which a vulnerability was observed.
- Ethical and Legal Guidelines: Reiterate the ethical considerations and legal frameworks (from Chapter 1) that govern your team's activities, especially concerning data privacy, responsible disclosure, and avoiding unintended harm.
Creating an LLM Red Team Playbook
While procedures provide the general framework, "plays" are specific, repeatable attack patterns that your team has developed or observed. Think of it as building a library of exploits tailored to LLMs. Documenting these plays allows for quick replication and adaptation in future assessments.
A well-documented play should include:
- Play Identifier: A unique name or code (e.g.,
LLM-PI-001-IndirectContext
, LLM-MEM-003-RoleConfusion
).
- Target Vulnerability Category: The general class of weakness this play targets (e.g., Indirect Prompt Injection, Jailbreaking, Excessive Agency, Information Disclosure).
- LLM/System Characteristics: Describe the type of LLM or system configuration where this play is likely to be effective (e.g., "LLMs with long context windows," "Systems using RAG from untrusted sources," "Models fine-tuned with insufficient safety data").
- Objective of the Play: What the attacker aims to achieve (e.g., "Bypass safety filter to generate restricted content," "Extract system prompts," "Induce the LLM to perform an unauthorized action via an integrated tool").
- Prerequisites/Setup: Any specific conditions or setup required (e.g., "Access to an API endpoint," "Ability to influence a document that will be retrieved by the LLM," "Multi-turn conversation history").
- Step-by-Step Execution:
- Detailed instructions on how to execute the play.
- Specific example prompts or input sequences.
- The expected interaction flow if the play is successful.
- Include variations that were attempted, both successful and unsuccessful, as these can provide valuable insights.
- Indicators of Success (IoS): Clear criteria for determining if the play was successful. This could be a specific output from the LLM, a change in system state, or observed behavior.
- Observed Impact: The actual or potential consequences if this play is successfully executed against a production system.
- Known Defenses Bypassed (if any): If the play successfully circumvented specific safety measures, document them.
- Mitigation Ideas: Initial thoughts on how this specific attack pattern could be mitigated (this feeds back into Chapter 5 and the current chapter's focus on remediation).
- Notes and Lessons Learned: Any observations, challenges encountered, or particular insights gained during the development or execution of this play. For example, "This play is highly sensitive to phrasing X but fails with phrasing Y."
Below is a diagram illustrating the key components of an attack play document:
Structure of an attack play document, highlighting essential information to capture for each recorded technique.
Organizing and Maintaining Your Knowledge Base
How you store and manage this documentation is also important. Some common approaches include:
- Internal Wiki: Platforms like Confluence, SharePoint, or dedicated wiki software allow for collaborative editing, easy linking between pages, and good searchability.
- Version Control Systems (e.g., Git): Storing documentation (especially in formats like Markdown) in a Git repository allows for version history, branching for new technique development, and review processes (e.g., pull requests for new plays).
- Dedicated Red Team Platforms: Some commercial or open-source red team operations platforms offer features for managing campaign data, including attack narratives and procedures.
Regardless of the tool, establish a process for:
- Regular Review and Updates: LLM technology and defenses change rapidly. Schedule periodic reviews of your procedures and plays to ensure they remain relevant and effective. Retire outdated information.
- Contribution Guidelines: Encourage all team members to contribute. Clear guidelines on how to document a new play or suggest a procedural update can foster a culture of shared ownership.
- Tagging and Categorization: Use tags or categories to make it easier to find relevant procedures or plays (e.g., by LLM type, vulnerability class, attack technique).
By diligently documenting your red teaming procedures and plays, you transform individual experiences and ad-hoc discoveries into a structured, evolving body of knowledge. This not only enhances the capabilities of your current team but also lays a solid foundation for future AI safety and security efforts. As you retest systems after remediation, these documented plays become your regression test suite, ensuring that previously closed vulnerabilities stay closed.