Machine learning models are vulnerable to exploitation by attackers with various goals and capabilities. To understand and counteract these vulnerabilities, a primary concept is the adversarial example. An adversarial example is an input deliberately crafted to fool a model, looking almost identical to a legitimate input yet causing the model to produce an incorrect output.
Consider a machine learning model, represented by a function , which maps an input from some domain (like an image or text) to an output (like a class label). Let be the correct output for a given input .
An untargeted adversarial example, , is a modified input that satisfies two conditions:
In a targeted attack, the goal is more specific: to make the model output a particular incorrect label (where ). The first condition becomes:
The difference between the original input and the adversarial example is the perturbation, denoted by :
So, we can write the adversarial example as . The "proximity" condition means that the perturbation must be small according to some measure.
How do we mathematically quantify "small"? The standard approach in adversarial machine learning is to use norms to measure the magnitude of the perturbation vector . The choice of norm reflects different assumptions about what constitutes an "imperceptible" or "allowable" change.
Let be a vector of dimension (e.g., pixels in an image). Common norms include:
Norm (Maximum Change): Measures the largest absolute change to any single element of the input. It's defined as:
An constraint means no single input feature (e.g., pixel value) is changed by more than . This is widely used for image attacks as small uniform changes are often hard to spot. For images with pixel values normalized to , a common value is .
Norm (Euclidean Distance): Measures the standard Euclidean distance between and .
An constraint limits the overall magnitude of the change vector. The changes might be concentrated in a few features or spread out thinly.
Norm (Sum of Absolute Changes): Measures the sum of the absolute changes across all elements.
An constraint encourages sparsity, meaning the perturbation might involve larger changes but only to a very small number of features (relevant for high-dimensional sparse data like text features).
Norm (Number of Changed Elements): Counts the number of elements in that are non-zero.
where is the indicator function (1 if true, 0 otherwise). An constraint means that at most features (e.g., pixels) can be altered. This is computationally harder to work with but directly models changes to a limited number of input components.
The choice of norm and the perturbation budget (or for ) are essential components of the threat model, defining the attacker's capability.
Finding an adversarial example can often be framed as an optimization problem. There are two common formulations:
Minimize Perturbation: Find the smallest perturbation (measured by some norm) that causes misclassification.
We also usually need to ensure the resulting remains a valid input (e.g., pixel values stay within the allowed range like ).
Maximize Loss (within Budget): Find the perturbation that maximizes the model's prediction error (loss) while staying within a predefined perturbation budget .
Here, is a loss function (like cross-entropy) that measures the discrepancy between the model's prediction on the perturbed input and the original correct label . For targeted attacks, we would maximize the loss with respect to the original label or minimize it with respect to the target label . This formulation directly leads to gradient-based attack methods, which we will explore in the next chapter.
Imagine the high-dimensional space where inputs live. The model partitions this space into regions corresponding to different classes, separated by decision boundaries. An original input lies within the region for its correct class . An adversarial example is found by moving just slightly (within the -ball of radius ) across a decision boundary into a region corresponding to a different class .
Illustration of an adversarial example. The original input (blue dot) is correctly classified. A small perturbation is added, resulting in (red dot), which lies just across the decision boundary and is misclassified. The perturbation magnitude is constrained, often by an norm ().
This mathematical framework allows us to precisely define adversarial examples and the constraints under which they are generated. It forms the foundation for developing specific attack algorithms (like those based on gradients or optimization) and for designing and evaluating defenses, which are the subjects of the following chapters. Understanding this formulation is essential for analyzing the security properties of machine learning models.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•