While gradient-based attacks offer potent ways to generate adversarial examples, as discussed previously, they typically require white-box access to the target model, meaning you need knowledge of its architecture and parameters, M(x;θ). But what happens when the Large Language Model (LLM) you're testing is a black box, offering only an API endpoint with no visibility into its internal workings? This is where transfer attacks using substitute models become a powerful technique in your red teaming toolkit.
Transfer attacks exploit a fascinating property of adversarial examples: their transferability. This means an adversarial input crafted to deceive one model often has a high chance of deceiving another model, even if the second model has a different architecture or was trained on different (though likely related) data. It’s as if these adversarial examples tap into more fundamental, shared vulnerabilities in how machine learning models interpret high-dimensional input spaces.
The core strategy of a transfer attack is to first create or train a substitute model (sometimes called a surrogate model) that mimics the behavior of the target black-box LLM. Once you have this substitute, which you do have white-box access to, you can then apply gradient-based or other white-box attack techniques to generate adversarial examples against it. Finally, you "transfer" these adversarial examples by feeding them to the original target LLM, hoping they'll cause the intended misbehavior.
Let's break down the typical workflow:
The first step is to create a dataset to train your substitute. This involves interacting with the target black-box LLM:
This querying process can be resource-intensive, potentially requiring many API calls, which might incur costs or trigger rate limits on the target system.
Once your substitute model Msubstitute is trained, you effectively have a local, white-box copy that (hopefully) behaves similarly to the target. Now, you can leverage standard white-box attack techniques to generate adversarial examples. For instance, you could use methods like the Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD) to find an input Xadv that, when fed to Msubstitute, produces an undesirable or malicious output Yadvs=Msubstitute(Xadv). Because you have full access to Msubstitute, including its gradients, crafting Xadv is much more direct.
The crucial final step is to take the adversarial example Xadv generated using your substitute model and feed it to the original black-box target LLM, Mtarget. If the adversarial properties transfer, then Mtarget(Xadv) will also yield an undesirable output Yadvt, similar to Yadvs, or at least achieve the attacker's goal (e.g., bypassing a safety filter, extracting information).
The diagram below illustrates this process:
The transfer attack process: querying the target, training a substitute, crafting adversarial examples on the substitute, and applying them to the target.
The success of transfer attacks isn't guaranteed, but several factors can increase the odds:
Transfer attacks are a significant concern because they effectively lower the barrier to entry for attacking LLMs. An attacker doesn't need privileged access or insider knowledge of the target model. They "only" need query access and the resources to train a substitute.
However, creating a high-fidelity substitute can be challenging. It requires:
Despite these challenges, the potential for transfer attacks means that even LLMs considered secure due to their black-box nature are not immune. As a red teamer, understanding how to simulate or perform such attacks helps in assessing the true resilience of an LLM system. If you can successfully craft transferable adversarial examples, it highlights a vulnerability that needs to be addressed, perhaps by improving input filtering, output sanitization, or developing methods to detect the training of substitute models via query patterns. This technique underscores the importance of a defense-in-depth strategy for LLM security.
Was this section helpful?
© 2025 ApX Machine Learning