All Courses

Metrics for Adversarial Robustness

Evaluating the effectiveness of a defense mechanism requires moving past standard performance measures like test accuracy. When adversarial threats are present, we need specific metrics that quantify how well a model withstands attempts to manipulate its predictions. These metrics help us compare different defenses, understand a model's vulnerabilities, and make informed decisions about deploying secure systems.

Accuracy Under Attack

The most straightforward metric is Accuracy Under Attack. This measures the model's classification accuracy on a dataset after each input has been perturbed by a specific adversarial attack.

To calculate this:

Choose a specific attack algorithm (e.g., Projected Gradient Descent - PGD).
Select attack parameters, most importantly the perturbation budget $\epsilon$ and the distance metric ( $L_p$ norm).
Run the attack on each sample in your evaluation dataset (typically the test set).
Evaluate the defended model's accuracy on these generated adversarial examples.

For example, you might report "Model A achieved 65% accuracy under a PGD $L_\infty$ attack with $\epsilon = 8/255$ ". This provides a concrete performance number under a defined threat. However, this metric is highly dependent on the strength and type of the attack used for evaluation. A model might seem strong against a weak attack (like single-step FGSM) but fail against a stronger one (like multi-step PGD).

Adversarial Robustness (Accuracy)

A more comprehensive goal is to measure Adversarial Robustness, often called Robust Accuracy. This aims to quantify the model's accuracy against the worst-case perturbation within a given budget $\epsilon$ and $L_p$ norm. Ideally, for a given input $x$ and label $y$ , we want to know if:

$\min_{\|\delta\|_p \le \epsilon} \mathcal{L}(f(x+\delta), y)$

results in a correct prediction, where $\mathcal{L}$ is the loss function and $f$ is the model. Since finding the exact worst-case perturbation $\delta$ is computationally intractable for complex models like deep neural networks, Robust Accuracy is typically estimated by using strong iterative attacks (like PGD with many steps) as a proxy for the worst-case adversary within the $\epsilon$ -ball.

Accuracy is always reported relative to a specific perturbation model, defined by the $L_p$ norm and the budget $\epsilon$ . For instance, "The model exhibits 55% accuracy against $L_\infty$ perturbations with $\epsilon = 4/255$ ."

Understanding Perturbation Norms ( $L_p$ Norms)

The choice of the $L_p$ norm is fundamental because it defines how we measure the "size" or "magnitude" of the adversarial perturbation $\delta$ . Different norms capture different types of changes:

$L_\infty$ Norm (Maximum Norm): Measures the maximum absolute change across all input features (e.g., pixels in an image). $\| \delta \|_\infty = \max_i | \delta_i |$ Small $L_\infty$ perturbations result in subtle changes distributed across many features, often imperceptible to humans. This is commonly used for image perturbations. An $\epsilon$ of 8/255 for pixel values in [0, 255] means no single pixel value changes by more than 8.
$L_2$ Norm (Euclidean Norm): Measures the standard Euclidean distance between the original input and the perturbed input. $\| \delta \|_2 = \sqrt{\sum_i \delta_i^2}$ $L_2$ perturbations distribute the change, but the total squared magnitude is bounded. It allows for slightly larger changes in some features if others change less.
$L_0$ Norm (Sparsity Norm): Counts the number of features that have been modified. $\| \delta \|_0 = \sum_i \mathbb{I}(\delta_i \neq 0)$ Where $\mathbb{I}$ is the indicator function. This norm is relevant for attacks that modify only a few features (e.g., changing a few pixels significantly).
$L_1$ Norm (Manhattan Norm): Measures the sum of the absolute changes across all features. $\| \delta \|_1 = \sum_i | \delta_i |$ Like $L_0$ , $L_1$ can encourage sparsity in the perturbation, meaning changes are concentrated in fewer features compared to $L_2$ or $L_\infty$ for a similar perceived magnitude.

The choice of norm should ideally reflect the anticipated threat model or the domain's characteristics. For images, $L_\infty$ and $L_2$ are common, while $L_0$ or $L_1$ might be more relevant for tabular or text data where changing only a few features makes sense.

Minimum Perturbation Distance

Instead of fixing $\epsilon$ and measuring accuracy, we can ask: for a given input $x$ , what is the smallest perturbation $\delta$ (measured by an $L_p$ norm) needed to cause a misclassification? This is the minimum perturbation distance or adversarial threshold:

$\epsilon^*(x) = \min_{\delta} \{ \| \delta \|_p \text{ s.t. } f(x+\delta) \neq \text{true\_label}(x) \}$

Calculating $\epsilon^*(x)$ precisely can be computationally demanding, often requiring optimization procedures similar to attack generation. However, reporting the average minimum perturbation distance across a dataset provides a valuable robustness metric. A higher average $\epsilon^*$ indicates greater robustness, as a larger change is needed, on average, to fool the model. This metric offers a more granular view than accuracy under a fixed $\epsilon$ .

Attack Success Rate (ASR)

Closely related to Accuracy Under Attack is the Attack Success Rate (ASR). It measures the percentage of originally correctly classified samples for which a specific attack successfully finds an adversarial example within the given constraints ( $\epsilon$ , $L_p$ ).

$\text{ASR} = \frac{\text{# Successful Attacks}}{\text{# Samples Originally Correctly Classified}}$

ASR focuses specifically on the attacker's ability to compromise correct predictions. If a model has an original accuracy of 90% and an ASR of 50% for a given attack, its Accuracy Under Attack will be roughly $90\% \times (1 - 0.50) = 45\%$ . (The exact relationship depends slightly on how misclassified samples are handled by the attack).

Certified Robustness

While the metrics above are typically empirical (based on running specific attacks), Certified Robustness provides a provable guarantee. A certified defense can, for a given input $x$ , guarantee that no perturbation $\delta$ within a certain $L_p$ ball of radius $\epsilon$ can cause a misclassification.

Metrics associated with certified defenses often include:

Certified Accuracy: The percentage of test samples for which the model's prediction can be certified as reliable against any perturbation within the specified $\epsilon$ -ball.
Average Certified Radius (ACR): The average value of the largest radius $\epsilon$ for which correctness can be certified across the test set.

Certified robustness offers stronger assurances than empirical evaluations but often comes at the cost of lower standard accuracy or smaller provable robustness radii compared to what empirical attacks might suggest. Techniques like Randomized Smoothing are prominent methods for achieving certified robustness.

Example relationship between the perturbation budget $\epsilon$ and estimated robust accuracy for different $L_p$ norms. Note that the scale and units of $\epsilon$ differ significantly between norms ( $L_\infty$ typically uses small values like fractions of 1/255, $L_2$ uses values around 0.5-3.0, $L_0$ counts pixels). Robust accuracy generally decreases as the allowed perturbation magnitude increases.

Practical Considerations

No single metric provides a complete picture of model security. A thorough evaluation should involve:

Reporting Accuracy estimated using strong, iterative attacks (e.g., PGD, C&W) for relevant $L_p$ norms and multiple $\epsilon$ values.
Clearly stating the attack parameters used (attack type, step size, number of iterations, norm, $\epsilon$ ).
Considering the Average Minimum Perturbation Distance to understand the typical effort required for an attack.
If applicable, reporting Certified Accuracy for provable guarantees.
Evaluating against a diverse set of attacks, including adaptive attacks specifically designed to circumvent the defense being tested (as discussed later in this chapter).

Choosing and interpreting these metrics correctly is fundamental for understanding the true security posture of machine learning models in adversarial environments.

Was this section helpful?