Understanding the landscape of adversarial attacks requires a structured way to classify them. As we've discussed threat models and the different attack surfaces available during training versus inference, we can now establish a taxonomy. This classification helps in analyzing vulnerabilities, comparing attack effectiveness, and designing appropriate defense strategies. Adversarial attacks are typically categorized along several key dimensions:
Categorization Axes
The primary axes used to classify adversarial attacks include:
- Attacker's Goal: What does the attacker intend to achieve?
- Attacker's Knowledge: How much information does the attacker possess about the target model?
- Attack Specificity: Is the attack designed to cause any error or a specific, predetermined error?
- Attack Frequency/Timing: When does the attack occur relative to the model's lifecycle (training or inference)?
Let's examine each of these dimensions in more detail.
Classification by Attacker's Goal
The objective of an attack significantly influences its design:
- Misclassification (Integrity Attack): The most common goal is to cause the model to produce an incorrect output for a given input. For a classification task, this means causing f(xadv)=ytrue, where xadv is the adversarial input, f is the model, and ytrue is the correct label. The attacker doesn't care what the wrong output is, as long as it's incorrect.
- Confidence Reduction: Instead of causing an outright misclassification, the attacker might aim to reduce the model's confidence in its prediction for a specific input. This can make the model seem unreliable or potentially easier to fool with a subsequent attack.
- Source/Target Misclassification (Targeted Attack): A more specific goal where the attacker wants the model to classify a particular input x (source) as a specific, incorrect target class t. The goal is f(xadv)=t, where t=ytrue. These are generally harder to achieve than untargeted misclassifications.
- Availability Attack: The attacker aims to degrade the model's overall performance, often by increasing its computation time (e.g., denial-of-service) or causing it to fail on a large fraction of inputs. This is often associated with poisoning attacks during training.
Classification by Attacker's Knowledge
The attacker's knowledge about the target model dictates the feasible attack strategies:
- White-box Attacks: The attacker has complete information about the model, including its architecture, parameters (weights and biases), activation functions, and potentially the training data or its distribution. This allows for powerful attacks that directly leverage the model's structure and gradients. Examples include the Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Carlini & Wagner (C&W) attacks, which often use gradient information to optimize the perturbation.
- Black-box Attacks: The attacker has limited or no knowledge of the model's internal workings. Interaction is typically restricted to querying the model with inputs and observing outputs.
- Score-based Attacks: The attacker can obtain confidence scores, probabilities, or logits associated with the model's predictions. This partial information can be used to estimate gradients (e.g., using finite differences) or employ optimization techniques that rely on score feedback.
- Decision-based Attacks: The attacker only receives the final prediction label (the hard classification output) for each query. This is the most restrictive setting, often requiring a large number of queries to find an adversarial example, for instance, by exploring the decision boundary (e.g., Boundary Attack).
- Transfer Attacks: A common black-box strategy involves crafting adversarial examples against a local substitute model (trained to mimic the target or a standard pre-trained model) and then transferring these examples to the target black-box model. This leverages the phenomenon that adversarial examples often exhibit transferability across different architectures.
Classification by Attack Specificity
This dimension overlaps with the attacker's goal but focuses specifically on the targetedness of the misclassification:
- Untargeted Attacks: The goal is simply to cause the model to misclassify the input xadv, irrespective of the resulting incorrect class. Mathematically, the attacker tries to find xadv such that f(xadv)=ytrue while minimizing the perturbation (e.g., ∣∣xadv−x∣∣p).
- Targeted Attacks: The goal is to cause the model to classify xadv as a specific target class t, where t=ytrue. The objective is to find xadv such that f(xadv)=t while minimizing the perturbation. Targeted attacks are generally more challenging but also more malicious in certain scenarios (e.g., forcing a self-driving car's perception system to classify a stop sign as a speed limit sign).
Classification by Attack Frequency/Timing
This categorization relates directly to the attack surfaces discussed earlier:
- Evasion Attacks (Test-Time / Inference-Time): These attacks occur after the model has been trained and deployed. The attacker manipulates individual inputs at inference time to cause misclassification. This is the most widely studied type of attack and assumes the model parameters are fixed. We will explore these extensively in Chapter 2.
- Poisoning Attacks (Training-Time): These attacks occur during the training phase. The attacker injects carefully crafted malicious data points (poisons) into the training dataset. The goal is to corrupt the learned model parameters, either to degrade its overall performance (availability attack) or to install backdoors that cause specific misbehavior on certain inputs later during inference (integrity attack). These will be covered in Chapter 3.
- Exploratory Attacks: These attacks don't necessarily aim to cause immediate misclassification but rather to extract information about the model or its training data. Examples include Membership Inference (determining if a specific data point was used in training), Attribute Inference (inferring sensitive attributes from training data), and Model Stealing (replicating the functionality of a proprietary model). These privacy and security implications are discussed in Chapter 4.
A visualization of the main axes used to categorize adversarial machine learning attacks. Real-world attacks often represent a combination of these categories (e.g., a white-box, targeted evasion attack).
Understanding this taxonomy is fundamental. It provides a framework for analyzing the threats posed to machine learning systems and helps in selecting or designing appropriate countermeasures. When evaluating a model's security or a proposed defense, it's important to consider which types of attacks (defined by these categories) are being addressed. Subsequent chapters will delve into specific attacks and defenses, often referencing this classification scheme.