Evaluating the effectiveness of a defense mechanism requires moving beyond standard performance measures like test accuracy. When adversarial threats are present, we need specific metrics that quantify how well a model withstands attempts to manipulate its predictions. These metrics help us compare different defenses, understand a model's vulnerabilities, and make informed decisions about deploying secure systems.
The most straightforward metric is Accuracy Under Attack. This measures the model's classification accuracy on a dataset after each input has been perturbed by a specific adversarial attack.
To calculate this:
For example, you might report "Model A achieved 65% accuracy under a PGD L∞ attack with ϵ=8/255". This provides a concrete performance number under a defined threat. However, this metric is highly dependent on the strength and type of the attack used for evaluation. A model might seem robust against a weak attack (like single-step FGSM) but fail against a stronger one (like multi-step PGD).
A more comprehensive goal is to measure Adversarial Robustness, often called Robust Accuracy. This aims to quantify the model's accuracy against the worst-case perturbation within a given budget ϵ and Lp norm. Ideally, for a given input x and label y, we want to know if:
min∥δ∥p≤ϵL(f(x+δ),y)
results in a correct prediction, where L is the loss function and f is the model. Since finding the exact worst-case perturbation δ is computationally intractable for complex models like deep neural networks, Robust Accuracy is typically estimated by using strong iterative attacks (like PGD with many steps) as a proxy for the worst-case adversary within the ϵ-ball.
Robust Accuracy is always reported relative to a specific perturbation model, defined by the Lp norm and the budget ϵ. For instance, "The model exhibits 55% robust accuracy against L∞ perturbations with ϵ=4/255."
The choice of the Lp norm is fundamental because it defines how we measure the "size" or "magnitude" of the adversarial perturbation δ. Different norms capture different types of changes:
L∞ Norm (Maximum Norm): Measures the maximum absolute change across all input features (e.g., pixels in an image). ∥δ∥∞=maxi∣δi∣ Small L∞ perturbations result in subtle changes distributed across many features, often imperceptible to humans. This is commonly used for image perturbations. An ϵ of 8/255 for pixel values in [0, 255] means no single pixel value changes by more than 8.
L2 Norm (Euclidean Norm): Measures the standard Euclidean distance between the original input and the perturbed input. ∥δ∥2=∑iδi2 L2 perturbations distribute the change, but the total squared magnitude is bounded. It allows for slightly larger changes in some features if others change less.
L0 Norm (Sparsity Norm): Counts the number of features that have been modified. ∥δ∥0=∑iI(δi=0) Where I is the indicator function. This norm is relevant for attacks that modify only a few features (e.g., changing a few pixels significantly).
L1 Norm (Manhattan Norm): Measures the sum of the absolute changes across all features. ∥δ∥1=∑i∣δi∣ Like L0, L1 can encourage sparsity in the perturbation, meaning changes are concentrated in fewer features compared to L2 or L∞ for a similar perceived magnitude.
The choice of norm should ideally reflect the anticipated threat model or the domain's characteristics. For images, L∞ and L2 are common, while L0 or L1 might be more relevant for tabular or text data where changing only a few features makes sense.
Instead of fixing ϵ and measuring accuracy, we can ask: for a given input x, what is the smallest perturbation δ (measured by an Lp norm) needed to cause a misclassification? This is the minimum perturbation distance or adversarial threshold:
ϵ∗(x)=minδ{∥δ∥p s.t. f(x+δ)=true_label(x)}
Calculating ϵ∗(x) precisely can be computationally demanding, often requiring optimization procedures similar to attack generation. However, reporting the average minimum perturbation distance across a dataset provides a valuable robustness metric. A higher average ϵ∗ indicates greater robustness, as a larger change is needed, on average, to fool the model. This metric offers a more granular view than accuracy under a fixed ϵ.
Closely related to Accuracy Under Attack is the Attack Success Rate (ASR). It measures the percentage of originally correctly classified samples for which a specific attack successfully finds an adversarial example within the given constraints (ϵ, Lp).
\text{ASR} = \frac{\text{# Successful Attacks}}{\text{# Samples Originally Correctly Classified}}
ASR focuses specifically on the attacker's ability to compromise correct predictions. If a model has an original accuracy of 90% and an ASR of 50% for a given attack, its Accuracy Under Attack will be roughly 90%×(1−0.50)=45%. (The exact relationship depends slightly on how misclassified samples are handled by the attack).
While the metrics above are typically empirical (based on running specific attacks), Certified Robustness provides a provable guarantee. A certified defense can, for a given input x, guarantee that no perturbation δ within a certain Lp ball of radius ϵ can cause a misclassification.
Metrics associated with certified defenses often include:
Certified robustness offers stronger assurances than empirical evaluations but often comes at the cost of lower standard accuracy or smaller provable robustness radii compared to what empirical attacks might suggest. Techniques like Randomized Smoothing are prominent methods for achieving certified robustness.
Example relationship between the perturbation budget ϵ and estimated robust accuracy for different Lp norms. Note that the scale and units of ϵ differ significantly between norms (L∞ typically uses small values like fractions of 1/255, L2 uses values around 0.5-3.0, L0 counts pixels). Robust accuracy generally decreases as the allowed perturbation magnitude increases.
No single metric provides a complete picture of model security. A thorough evaluation should involve:
Choosing and interpreting these metrics correctly is fundamental for understanding the true security posture of machine learning models in adversarial environments.
© 2025 ApX Machine Learning