Implementing a defense is not sufficient for ensuring model security. Methods are necessary to accurately measure how secure models truly are. A primary aspect of this measurement process involves tailoring security evaluations to specific, well-defined threat models. A defense mechanism might appear effective against one type of attacker but fail completely against another with different knowledge or capabilities. Therefore, understanding how to structure evaluations under various security scenarios is essential for gaining a realistic perspective on model security.
Recall from Chapter 1 that a threat model defines the assumed characteristics of a potential attacker, typically focusing on their goals, knowledge of the target system, and capabilities to interact with it or manipulate its inputs. When evaluating model robustness, we must simulate attackers consistent with these assumptions.
The attacker's knowledge about the target model significantly influences the types of attacks they can mount and, consequently, how we should evaluate defenses against them.
"* Black-Box Evaluations: Here, the attacker has minimal knowledge of the model internals. They can typically only query the model (e.g., through an API) and observe its outputs. This often mirrors deployment scenarios. Black-box settings can be further refined:" * Score-Based: The attacker receives confidence scores or probabilities along with the predicted label. * Decision-Based: The attacker only receives the final predicted label. * Evaluation Strategy: Employ attacks that rely solely on model queries. * Transfer Attacks: Train a local substitute model and craft attacks against it, hoping they transfer to the target model. This requires a reasonable number of queries to build the substitute. * Score-Based Attacks: Use techniques that estimate gradients or search directions based on confidence score changes (e.g., NES, SPSA). * Decision-Based Attacks: Utilize algorithms that explore the decision boundary with minimal information, like the Boundary Attack. These often require more queries. * Purpose: Black-box evaluations assess security in deployment contexts. They test resilience against attackers who lack inside information. However, interpreting results requires care. A model appearing robust in a setting might still be vulnerable if the evaluation attack was not query-efficient enough or if transfer attacks were not adequately explored.
Information access levels for attackers in different evaluation settings. White-box attackers have full internal access, while black-box attackers rely on query outputs (scores or just decisions).
Attacker's capabilities constrain their actions. Evaluations must respect these limits.
Accuracy of a standard model versus an adversarially trained model under PGD attacks with increasing L∞ perturbation budgets (ϵ). The model maintains higher accuracy as the allowed perturbation grows.
Query Limits: For black-box attacks, the number of queries an attacker can make might be limited by cost, time, or API rate limits.
Semantic or Domain Constraints: Attacks might need to respect domain-specific rules. Adversarial text examples should remain grammatically plausible; adversarial patches must be printable and adaptable to environmental changes.
A comprehensive security evaluation doesn't rely on a single threat model. Instead, it builds a profile of the model's security across a range of relevant scenarios.
Evaluating under different threat models prevents a false sense of security. A model against weak black-box attacks might crumble under white-box scrutiny. Conversely, perfect white-box security might be overkill if attackers realistically only have limited black-box access. By systematically testing against plausible attacker profiles defined by threat models, we can gain a much more reliable and actionable understanding of our ML systems' security posture. Frameworks like ART and CleverHans provide tools to configure and execute attacks under many of these varied assumptions.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with