After running numerous attacks and defenses through evaluation frameworks, you are left with a collection of numbers, charts, and logs. The critical task now is to make sense of these results. Simply reporting "Model X achieved 60% accuracy against PGD-L∞ with ϵ=8/255" is insufficient. Interpreting robustness requires careful consideration of the context, the metrics used, and potential evaluation pitfalls.
Contextualizing Robustness Numbers
A single robustness metric, like accuracy under attack, is only meaningful when placed within a proper context. Ask yourself these questions:
- What was the exact attack configuration? Specify the attack algorithm (e.g., PGD, C&W, AutoAttack), the perturbation constraint (Lp norm and magnitude ϵ), the number of attack iterations, step size, random initializations, and any other hyperparameters. Small changes in these parameters can significantly alter outcomes.
- What is the baseline? How does the robust accuracy compare to the model's accuracy on clean, unperturbed data? What is the accuracy of a standard, undefended model under the same attack? Comparing against these baselines reveals the effectiveness of the defense and its associated cost (often a drop in clean accuracy).
- What threat model does the evaluation represent? Does the evaluation assume a white-box attacker with full knowledge, or a more restricted black-box scenario? The interpretation changes drastically based on the assumed attacker capabilities. Robustness against a weak attack might offer little comfort against a stronger adversary.
- Was the evaluation adaptive? As discussed previously, defenses can sometimes achieve apparent robustness by masking gradients or other techniques that break specific attack algorithms. Did the evaluation employ adaptive attacks specifically designed to circumvent the defense mechanism? Robustness demonstrated only against standard, non-adaptive attacks should be viewed with skepticism.
Beyond Simple Accuracy
While accuracy under a specific attack is a common metric, it doesn't tell the whole story. A comprehensive interpretation should consider multiple facets:
- Accuracy vs. Perturbation Budget (ϵ): Instead of reporting accuracy at a single ϵ, analyze the trend. How quickly does accuracy drop as the allowed perturbation magnitude increases? Plotting accuracy curves against ϵ provides a much richer picture of a model's resilience. Robust models typically exhibit a more gradual decline in accuracy compared to standard models.
Accuracy degradation under increasing L∞ PGD attack strength (ϵ) for a standard vs. a robust model.
- Robustness-Accuracy Trade-off: Many effective defenses, particularly adversarial training, often lead to a decrease in accuracy on clean, unperturbed data. Quantify this trade-off. Is the gain in robustness worth the potential drop in standard performance for the target application?
- Different Lp Norms: Evaluate robustness under different constraints, such as L∞, L2, and potentially L0 or L1. Robustness in one norm does not automatically imply robustness in others. For instance, a model robust to small, widespread changes (L∞) might still be vulnerable to larger changes affecting fewer pixels (L2 or L0). The relevant norm often depends on the application domain and expected perturbation types.
- Minimum Perturbation: Instead of fixing ϵ and measuring accuracy, you can measure the average minimum perturbation required to cause misclassification. A higher average minimum ϵ indicates greater robustness.
- Attack Transferability: If evaluating black-box robustness, consider the transferability of attacks generated against different surrogate models. High transferability suggests vulnerabilities that are not specific to one model's architecture or weights.
- Computational Cost: Defenses like adversarial training or certified defenses can significantly increase training time or inference latency. This practical cost must be weighed against the security benefits.
Recognizing Evaluation Pitfalls
Be wary of results that seem too good to be true. A common issue is obfuscated gradients, where a defense mechanism makes it harder for specific gradient-based attacks to find adversarial examples, rather than truly making the model more robust. Signs of obfuscated gradients include:
- High robustness against strong gradient-based attacks (like PGD) but vulnerability to simpler gradient-free methods (like SPSA) or score/decision-based attacks.
- High robustness against attacks with few iterations, but accuracy plummeting when the number of attack steps is significantly increased.
- Strong performance against white-box attacks but poor performance against black-box transfer attacks.
Always perform sanity checks using a diverse set of attack algorithms, including adaptive attacks if evaluating a novel defense. Relying solely on standard PGD or FGSM against a new defense is insufficient.
Reporting and Comparison Challenges
Comparing robustness results across different studies can be challenging due to variations in:
- Datasets (MNIST, CIFAR-10, ImageNet have different intrinsic difficulties).
- Model architectures.
- Attack implementations and hyperparameters within evaluation frameworks (e.g., ART, CleverHans, Foolbox).
- Computational budgets for attacks.
When reporting your own results, be explicit about all these details to allow for fair comparisons. When interpreting others' results, look for this information. Resources like RobustBench aim to standardize evaluations on specific benchmarks, providing a more reliable basis for comparison, but even these have limitations regarding the scope of attacks considered.
Ultimately, interpreting robustness evaluation results is not about finding a single number but about building a comprehensive understanding of a model's security posture under specific, well-defined assumptions. It involves analyzing trends, considering trade-offs, comparing against appropriate baselines, and being vigilant for evaluation errors like gradient obfuscation. This nuanced understanding is essential for making informed decisions about deploying machine learning models in security-sensitive applications.