Standard single-shot prompts, where you send one instruction and get one response, are only one way to interact with Large Language Models. Many LLM applications are designed for extended dialogues, maintaining context over several turns. This conversational capability, while beneficial for user experience, opens up a distinct set of attack vectors that we, as red teamers, must understand and test. Multi-turn conversation attacks exploit the LLM's memory of the preceding dialogue to manipulate its behavior, elicit sensitive information, or bypass safety protocols in ways that might not be possible with isolated prompts.
These attacks rely on the LLM's context window. The context window is the amount of prior conversation the model can "remember" and consider when generating its next response. An attacker can strategically fill this context window over several interactions to guide the LLM towards a compromised state. This section builds upon our understanding of vulnerabilities like prompt injection and jailbreaking by examining how they can be incrementally executed or amplified within a conversational flow.
Multi-turn attacks are often more subtle than their single-prompt counterparts. They possess several distinct characteristics:
Think about it like a negotiation. A single, outrageous demand is likely to be rejected. However, a series of smaller, more reasonable-sounding requests can gradually lead the other party to a position they would have initially refused.
Several specific strategies fall under the umbrella of multi-turn conversation attacks. Let's examine some of the more common ones.
A long or complex set of instructions that might be flagged as suspicious if delivered in a single prompt can sometimes be broken down and delivered piecemeal across multiple turns. Each individual part might seem harmless, but their combined effect can be to inject a malicious prompt or steer the LLM towards an undesirable action.
For instance, an attacker might try to build up a scenario:
Here, the attacker gradually shifts the context from "fictional story" to "bypassing security systems." Each step seems plausible within the stated goal, but the cumulative effect is a request for potentially sensitive information. The red teamer's goal is to see if the LLM's safeguards are sufficient to recognize the pattern or if they only evaluate prompts in isolation.
This technique involves "priming" the LLM with certain information or a specific persona early in the conversation. This initial context then subtly influences the LLM's responses in later turns, potentially leading it to reveal information or adopt behaviors it otherwise wouldn't.
Imagine an attacker wants the LLM to generate code for a dubious purpose.
The "poison" here is the attacker's claimed identity and benign intent, which aims to lower the LLM's guard for subsequent, more problematic requests.
LLMs strive to be helpful and maintain conversational coherence. Over extended dialogues, especially if the user is persistent, the model might drift from its initial safety programming or become more deeply entrenched in an adopted persona.
This is an extension of the role-playing attacks discussed in Chapter 2 ("Jailbreaking and Role-Playing Attacks"). In a multi-turn scenario, an attacker can:
The diagram below illustrates a simplified flow of how an attacker might try to achieve persona amplification over multiple turns to bypass safety guidelines.
This diagram shows an attacker attempting to establish and then escalate a "CritiqueBot" persona over three turns to elicit potentially policy-violating content. The initial turns are less harmful, aiming to commit the LLM to the persona.
The important aspect is the gradual escalation. If the attacker asked for the most extreme output on turn one, the LLM's safety filters would likely engage. By incrementally pushing the boundaries within the established persona, the attacker hopes to wear down these defenses.
Sensitive information is rarely divulged by an LLM in a single query. However, an attacker might be able to piece together a more complete picture by asking a series of related, less direct questions over multiple turns. Each answer might be innocuous on its own, but their combination could reveal confidential data or system insights.
For example, instead of asking "What is the database password?", which would almost certainly be denied, an attacker might try:
While a well-secured LLM should not answer these in a way that leaks specific secrets, a red teamer tests these sequences to see how much peripheral information can be gathered and if it cumulatively creates a risk.
When you are red teaming an LLM for multi-turn vulnerabilities, your approach requires more subtlety than single-prompt attacks.
These attacks are particularly challenging to defend against because:
Briefly, some mitigation approaches (which we will detail in Chapter 5, "Defenses and Mitigation Strategies for LLMs") involve more sophisticated context-aware monitoring, techniques to prevent conversational drift away from safety guidelines, and methods to detect and reset suspicious conversational states.
Multi-turn conversation attacks require a different mindset from the red teamer. It's less about a single "gotcha" prompt and more about a sustained campaign of influence. By understanding how to manipulate conversational context, you can identify significant weaknesses in how LLMs handle state and memory over extended interactions. This understanding is essential for building more resilient and trustworthy AI systems.
Was this section helpful?
© 2025 ApX Machine Learning