One of the more intriguing and practically significant phenomena in adversarial machine learning is the transferability of adversarial examples. This means that an adversarial example xadv, crafted to fool a specific source model fS, often successfully fools a different target model fT, even if fT has a distinct architecture or was trained independently (though typically on a similar data distribution).
Imagine you've painstakingly crafted an adversarial image using the Projected Gradient Descent (PGD) attack against a ResNet-50 model you trained locally. Transferability suggests that this very same image might also trick a VGG-16 or an Inception-v3 model deployed by someone else, without you needing any specific knowledge about those target models.
This property has profound implications, particularly for black-box attacks. In a black-box setting, the attacker has limited information about the target model fT. They might only be able to query the model and observe its outputs (labels or scores), without access to its architecture, parameters, or gradients. Transferability provides a viable attack vector:
- Train a Substitute Model: The attacker trains their own local model, fS (the substitute), trying to mimic the behavior of the target fT. This can be done using a relevant public dataset or by querying fT to create a synthetic training set.
- Craft Adversarial Examples: The attacker uses white-box methods (like PGD or C&W, discussed earlier in this chapter) to generate adversarial examples xadv against their substitute model fS.
- Attack the Target Model: The attacker submits these adversarial examples xadv to the black-box target model fT. Due to transferability, there's a reasonable chance that fT will also misclassify them.
A typical workflow for a black-box attack leveraging transferability. The attacker crafts examples against a local substitute model and uses them against the unknown target model.
Why Do Adversarial Examples Transfer?
The exact reasons for transferability are still an active area of research, but several hypotheses offer compelling explanations:
- Shared Decision Boundaries: Different models trained for the same task (e.g., classifying cats vs. dogs) tend to learn similar decision boundaries in the input space, especially for high-probability regions corresponding to typical inputs. Adversarial examples often lie near these boundaries. A perturbation pushing an input across the boundary for model fS might also push it across the similar boundary learned by fT.
- Common Feature Representations: Deep neural networks, particularly for tasks like image recognition, often learn hierarchical features. Lower-level features (edges, textures) and even some higher-level features might be quite similar across different architectures. Perturbations that exploit vulnerabilities in these shared features are more likely to transfer.
- Input Space Geometry: Adversarial examples might exist in large, contiguous subspaces. If fS and fT approximate the same underlying function, an adversarial region for fS might significantly overlap with an adversarial region for fT.
- Linearity Hypothesis (related to FGSM): As initially suggested by Goodfellow et al., even highly non-linear models can behave quite linearly locally. Gradient-based attacks exploit this local linearity. If different models exhibit similar local linear behavior around data points, the adversarial examples generated using gradient information might transfer.
Factors Influencing Transferability Strength
Not all adversarial examples transfer equally well. The degree of transferability depends on several factors:
- Model Architecture Similarity: Attacks tend to transfer better between models with similar architectures (e.g., VGG to VGG) than between vastly different ones (e.g., a deep CNN to a shallow decision tree).
- Source/Target Model Performance: Attacks generated on a highly accurate source model might transfer better. Conversely, a more robust target model will naturally be less susceptible to transferred examples.
- Attack Method: Iterative gradient-based attacks like PGD often exhibit good transferability. Optimization-based attacks like C&W can sometimes overfit to the source model's specific parameters, potentially reducing their transferability compared to PGD, although they might find highly effective examples against the source. Single-step attacks like FGSM generally transfer less effectively than their iterative counterparts.
- Perturbation Size (ϵ): Larger perturbations often lead to higher transferability rates, but they are also more likely to be perceptible to humans or detected by simple defenses.
- Dataset and Task: Transferability is commonly observed in image classification tasks. Its prevalence and strength might differ for other domains like natural language processing or tabular data.
Example transfer success rates between different ImageNet classification models using a strong evasion attack like PGD. Actual rates depend heavily on the specific attack parameters, dataset, and model checkpoints.
Implications for Evaluating Defenses
The existence of transferability complicates the evaluation of defense mechanisms. A defense that appears effective against white-box attacks specifically crafted for the defended model might still be vulnerable to attacks transferred from other undefended models.
Therefore, a comprehensive robustness evaluation should include:
- Direct White-Box Attacks: Generate attacks assuming full knowledge of the defended model.
- Direct Black-Box Attacks: Simulate realistic black-box scenarios (score-based, decision-based).
- Transfer Attacks: Generate attacks against one or more standard, undefended models (substitutes) and test their effectiveness against the defended model.
If a defense significantly degrades the transferability of attacks from standard models, it suggests the defense mechanism is genuinely altering the model's decision-making process in a meaningful way, rather than just masking gradients (which we'll discuss later).
Understanding transferability is essential not only for designing potent black-box attacks but also for building and verifying truly robust machine learning systems. It highlights that securing a model requires considering its behavior not just in isolation but also in relation to other models operating in the same problem space.