Having established that machine learning models can be vulnerable and that attackers have various goals and capabilities, let's formalize the central concept of an adversarial example. At its core, an adversarial example is an input deliberately crafted to fool a model, looking almost identical to a legitimate input yet causing the model to produce an incorrect output.
Consider a machine learning model, represented by a function f, which maps an input x from some domain (like an image or text) to an output y (like a class label). Let y=f(x) be the correct output for a given input x.
An untargeted adversarial example, x′, is a modified input that satisfies two conditions:
In a targeted attack, the goal is more specific: to make the model output a particular incorrect label ytarget (where ytarget=y). The first condition becomes:
f(x′)=ytargetThe difference between the original input and the adversarial example is the perturbation, denoted by δ:
δ=x′−xSo, we can write the adversarial example as x′=x+δ. The "proximity" condition means that the perturbation δ must be small according to some measure.
How do we mathematically quantify "small"? The standard approach in adversarial machine learning is to use Lp norms to measure the magnitude of the perturbation vector δ. The choice of norm reflects different assumptions about what constitutes an "imperceptible" or "allowable" change.
Let δ be a vector of dimension d (e.g., d pixels in an image). Common norms include:
L∞ Norm (Maximum Change): Measures the largest absolute change to any single element of the input. It's defined as:
∥δ∥∞=i=1,…,dmax∣δi∣An L∞ constraint ∥δ∥∞≤ϵ means no single input feature (e.g., pixel value) is changed by more than ϵ. This is widely used for image attacks as small uniform changes are often hard to spot. For images with pixel values normalized to [0,1], a common ϵ value is 8/255.
L2 Norm (Euclidean Distance): Measures the standard Euclidean distance between x and x′.
∥δ∥2=i=1∑dδi2An L2 constraint ∥δ∥2≤ϵ limits the overall magnitude of the change vector. The changes might be concentrated in a few features or spread out thinly.
L1 Norm (Sum of Absolute Changes): Measures the sum of the absolute changes across all elements.
∥δ∥1=i=1∑d∣δi∣An L1 constraint ∥δ∥1≤ϵ encourages sparsity, meaning the perturbation might involve larger changes but only to a very small number of features (relevant for high-dimensional sparse data like text features).
L0 Norm (Number of Changed Elements): Counts the number of elements in δ that are non-zero.
∥δ∥0=i=1∑dI(δi=0)where I(⋅) is the indicator function (1 if true, 0 otherwise). An L0 constraint ∥δ∥0≤k means that at most k features (e.g., pixels) can be altered. This is computationally harder to work with but directly models changes to a limited number of input components.
The choice of Lp norm and the perturbation budget ϵ (or k for L0) are essential components of the threat model, defining the attacker's capability.
Finding an adversarial example can often be framed as an optimization problem. There are two common formulations:
Minimize Perturbation: Find the smallest perturbation δ (measured by some Lp norm) that causes misclassification.
δmin∥δ∥psubject tof(x+δ)=yWe also usually need to ensure the resulting x′=x+δ remains a valid input (e.g., pixel values stay within the allowed range like [0,1]).
Maximize Loss (within Budget): Find the perturbation δ that maximizes the model's prediction error (loss) while staying within a predefined perturbation budget ϵ.
δ s.t. ∥δ∥p≤ϵmaxL(f(x+δ),y)Here, L is a loss function (like cross-entropy) that measures the discrepancy between the model's prediction on the perturbed input f(x+δ) and the original correct label y. For targeted attacks, we would maximize the loss with respect to the original label or minimize it with respect to the target label ytarget. This formulation directly leads to gradient-based attack methods, which we will explore in the next chapter.
Imagine the high-dimensional space where inputs live. The model f partitions this space into regions corresponding to different classes, separated by decision boundaries. An original input x lies within the region for its correct class y. An adversarial example x′ is found by moving x just slightly (within the Lp-ball of radius ϵ) across a decision boundary into a region corresponding to a different class y′.
Illustration of an adversarial example. The original input x (blue dot) is correctly classified. A small perturbation δ is added, resulting in x′ (red dot), which lies just across the decision boundary and is misclassified. The perturbation magnitude is constrained, often by an Lp norm (∥δ∥p≤ϵ).
This mathematical framework allows us to precisely define adversarial examples and the constraints under which they are generated. It forms the foundation for developing specific attack algorithms (like those based on gradients or optimization) and for designing and evaluating defenses, which are the subjects of the following chapters. Understanding this formulation is essential for analyzing the security properties of machine learning models.
© 2025 ApX Machine Learning