While gradient-based attacks offer potent ways to generate adversarial examples, as discussed previously, they typically require white-box access to the target model, meaning you need knowledge of its architecture and parameters, $M(x; \theta)$. But what happens when the Large Language Model (LLM) you're testing is a black box, offering only an API endpoint with no visibility into its internal workings? This is where transfer attacks using substitute models become a powerful technique in your red teaming toolkit.Transfer attacks exploit a fascinating property of adversarial examples: their transferability. This means an adversarial input crafted to deceive one model often has a high chance of deceiving another model, even if the second model has a different architecture or was trained on different (though likely related) data. It’s as if these adversarial examples tap into more fundamental, shared vulnerabilities in how machine learning models interpret high-dimensional input spaces.The core strategy of a transfer attack is to first create or train a substitute model (sometimes called a surrogate model) that mimics the behavior of the target black-box LLM. Once you have this substitute, which you do have white-box access to, you can then apply gradient-based or other white-box attack techniques to generate adversarial examples against it. Finally, you "transfer" these adversarial examples by feeding them to the original target LLM, hoping they'll cause the intended misbehavior.Let's break down the typical workflow:1. Building the Substitute ModelThe first step is to create a dataset to train your substitute. This involves interacting with the target black-box LLM:Querying the Target LLM: You'll send a series of diverse input prompts, $X_q$, to the target LLM. These could be legitimate-seeming queries, carefully crafted probes, or even random strings, depending on your strategy.Collecting Outputs: For each input $X_{q_i}$, you record the corresponding output $Y_{q_i} = M_{target}(X_{q_i})$ from the target LLM. This input-output pair $(X_{q_i}, Y_{q_i})$ becomes a data point for training your substitute. The more data points you collect, and the more representative they are of the target's behavior, the better your substitute model is likely to be.Training the Substitute: With this collected dataset, you then train your own model, $M_{substitute}$. This substitute could be another LLM (perhaps a smaller, open-source model like a variant of GPT-2 or LLaMA that you can fine-tune) or even a simpler classification/regression model if the target LLM's task is specific enough. The goal is for $M_{substitute}$ to approximate the input-output behavior of $M_{target}$.This querying process can be resource-intensive, potentially requiring many API calls, which might incur costs or trigger rate limits on the target system.2. Crafting Adversarial Examples on the SubstituteOnce your substitute model $M_{substitute}$ is trained, you effectively have a local, white-box copy that (hopefully) behaves similarly to the target. Now, you can leverage standard white-box attack techniques to generate adversarial examples. For instance, you could use methods like the Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD) to find an input $X_{adv}$ that, when fed to $M_{substitute}$, produces an undesirable or malicious output $Y_{adv_s} = M_{substitute}(X_{adv})$. Because you have full access to $M_{substitute}$, including its gradients, crafting $X_{adv}$ is much more direct.3. Transferring the Attack to the Target LLMThe main final step is to take the adversarial example $X_{adv}$ generated using your substitute model and feed it to the original black-box target LLM, $M_{target}$. If the adversarial properties transfer, then $M_{target}(X_{adv})$ will also yield an undesirable output $Y_{adv_t}$, similar to $Y_{adv_s}$, or at least achieve the attacker's goal (e.g., bypassing a safety filter, extracting information).The diagram below illustrates this process:digraph G { rankdir=TB; node [shape=box, style="filled,rounded", fontname="sans-serif"]; edge [fontname="sans-serif"]; attacker [label="Attacker", fillcolor="#e9ecef"]; target_llm [label="Target LLM\n(Black Box)", fillcolor="#a5d8ff", shape=cylinder, style="filled"]; substitute_llm [label="Substitute LLM\n(White Box)", fillcolor="#b2f2bb", shape=cylinder, style="filled"]; adv_gen [label="Adversarial Example\nGeneration", fillcolor="#ffc9c9"]; success [label="Successful Attack\non Target", fillcolor="#ff8787", shape=octagon]; attacker -> target_llm [label="1. Queries (Xq)"]; target_llm -> attacker [label="2. Outputs (Yq)"]; attacker -> substitute_llm [label="3. Train with (Xq, Yq)"]; substitute_llm -> adv_gen [label="4. Use for crafting"]; adv_gen -> attacker [label="5. Obtain X_adv"]; attacker -> target_llm [label="6. Submit X_adv"]; target_llm -> success [label="7. Malicious Output Y_adv_t", style=dashed, color="#f03e3e"]; }The transfer attack process: querying the target, training a substitute, crafting adversarial examples on the substitute, and applying them to the target.Factors Influencing TransferabilityThe success of transfer attacks isn't guaranteed, but several factors can increase the odds:Substitute Model Similarity: The more closely $M_{substitute}$'s architecture, training data distribution, or decision boundaries resemble those of $M_{target}$, the higher the likelihood of transfer. Using a pre-trained LLM of a similar family as the target, if known or suspected, can be beneficial.Attack Method: Some adversarial attack generation methods inherently produce more transferable examples. Iterative methods that find more effective adversarial examples often transfer better than single-step methods.Data Quality for Substitute: The diversity and volume of query-response pairs used to train the substitute model play a significant role. A substitute trained on a dataset that better captures the details of the target's behavior will likely lead to more transferable attacks.Model Capacity: Larger models sometimes exhibit better transferability for attacks crafted on them. Conversely, attacks crafted on very small substitute models might not transfer well to large, complex target models.Why This Matters for Red TeamingTransfer attacks are a significant concern because they effectively lower the barrier to entry for attacking LLMs. An attacker doesn't need privileged access or insider knowledge of the target model. They "only" need query access and the resources to train a substitute.However, creating a high-fidelity substitute can be challenging. It requires:A substantial number of queries to the target model, which can be slow, expensive, or trigger detection mechanisms if the queries are unusual or voluminous.Computational resources to train the substitute model itself.Careful selection of the substitute model architecture and training process.Despite these challenges, the potential for transfer attacks means that even LLMs considered secure due to their black-box nature are not immune. As a red teamer, understanding how to simulate or perform such attacks helps in assessing the true resilience of an LLM system. If you can successfully craft transferable adversarial examples, it highlights a vulnerability that needs to be addressed, perhaps by improving input filtering, output sanitization, or developing methods to detect the training of substitute models via query patterns. This technique underscores the importance of a defense-in-depth strategy for LLM security.