As we discussed earlier in this chapter, simply implementing a defense isn't sufficient. We need robust methods to measure how secure our models truly are. A critical aspect of this measurement process involves tailoring our security evaluations to specific, well-defined threat models. A defense mechanism might appear effective against one type of attacker but fail completely against another with different knowledge or capabilities. Therefore, understanding how to structure evaluations under various threat scenarios is essential for gaining a realistic perspective on model security.
Recall from Chapter 1 that a threat model defines the assumed characteristics of a potential attacker, typically focusing on their goals, knowledge of the target system, and capabilities to interact with it or manipulate its inputs. When evaluating model robustness, we must simulate attackers consistent with these assumptions.
Evaluating Based on Attacker Knowledge
The attacker's knowledge about the target model significantly influences the types of attacks they can mount and, consequently, how we should evaluate defenses against them.
-
White-Box Evaluations: This scenario assumes the attacker has complete knowledge of the model, including its architecture, parameters (weights and biases), and potentially the training data or process. This is often considered the worst-case scenario for the defender regarding information leakage.
- Evaluation Strategy: Use attacks that directly leverage model gradients and internal structure. Powerful gradient-based attacks like Projected Gradient Descent (PGD) or optimization-based attacks like Carlini & Wagner (C&W) are standard choices.
- Purpose: White-box evaluations test the intrinsic robustness of the model and defense mechanism against the strongest possible attacks within given perturbation constraints. They are vital for research and development to understand fundamental vulnerabilities. Even if a real-world attacker might not have full white-box access, robustness in this setting provides a strong baseline guarantee. If a model isn't secure under white-box attacks, it's unlikely to be secure under weaker assumptions.
-
Black-Box Evaluations: Here, the attacker has minimal knowledge of the model internals. They can typically only query the model (e.g., through an API) and observe its outputs. This often mirrors real-world deployment scenarios. Black-box settings can be further refined:
- Score-Based: The attacker receives confidence scores or probabilities along with the predicted label.
- Decision-Based: The attacker only receives the final predicted label.
- Evaluation Strategy: Employ attacks that rely solely on model queries.
- Transfer Attacks: Train a local substitute model and craft attacks against it, hoping they transfer to the target model. This requires a reasonable number of queries to build the substitute.
- Score-Based Attacks: Use techniques that estimate gradients or search directions based on confidence score changes (e.g., NES, SPSA).
- Decision-Based Attacks: Utilize algorithms that explore the decision boundary with minimal information, like the Boundary Attack. These often require more queries.
- Purpose: Black-box evaluations assess security in more realistic deployment contexts. They test resilience against attackers who lack inside information. However, interpreting results requires care. A model appearing robust in a black-box setting might still be vulnerable if the evaluation attack was not query-efficient enough or if transfer attacks were not adequately explored.
Information access levels for attackers in different evaluation settings. White-box attackers have full internal access, while black-box attackers rely on query outputs (scores or just decisions).
- Gray-Box Evaluations: This represents intermediate scenarios where the attacker might know parts of the system, such as the model architecture or the types of defenses employed, but not the specific parameters. Evaluation strategies often combine elements from white-box and black-box approaches, potentially using known architectural information to guide black-box query strategies or substitute model training.
Evaluating Based on Attacker Capabilities
Beyond knowledge, the attacker's capabilities constrain their actions. Evaluations must respect these limits.
- Perturbation Constraints: Most evasion attacks involve perturbing inputs. How much perturbation is allowed? This is typically measured using Lp norms:
- L∞: Maximum change to any single feature (pixel, word embedding dimension). Common for image attacks.
- L2: Total Euclidean magnitude of the change. Also common for images and continuous data.
- L0: Number of features changed. Relevant for sparse attacks or text where changing few words is desired.
- Evaluation Strategy: Run attacks with specific perturbation budgets (denoted by ϵ). Report accuracy under attack for various (ϵ,Lp) combinations. For example, "The model achieves 75% accuracy against PGD L∞ attacks with ϵ=0.03" or "The minimum L2 distortion needed to fool the model on average is 1.5". Visualizing accuracy drop as ϵ increases is a standard practice.
Accuracy of a standard model versus an adversarially trained model under PGD attacks with increasing L∞ perturbation budgets (ϵ). The robust model maintains higher accuracy as the allowed perturbation grows.
-
Query Limits: For black-box attacks, the number of queries an attacker can make might be limited by cost, time, or API rate limits.
- Evaluation Strategy: Measure attack success rate or model accuracy as a function of the number of queries allowed. A defense might be considered effective if it forces the attacker to use an impractically large number of queries.
-
Semantic or Domain Constraints: Attacks might need to respect domain-specific rules. Adversarial text examples should remain grammatically plausible; adversarial patches in the physical world must be printable and robust to environmental changes.
- Evaluation Strategy: Incorporate these constraints into the attack generation process during evaluation. This often requires specialized attack algorithms (covered partly in Chapter 7).
Synthesizing Evaluations Across Threat Models
A comprehensive security evaluation doesn't rely on a single threat model. Instead, it builds a profile of the model's security across a range of relevant scenarios.
- Start with Strong White-Box Attacks: Use standard, powerful attacks like PGD (L∞ and L2) with commonly accepted ϵ values (e.g., 8/255 for L∞ on CIFAR-10 images) as a baseline. This establishes the model's fundamental robustness.
- Consider Realistic Black-Box Scenarios: If the model is deployed behind an API, evaluate against query-based attacks (score or decision-based, depending on the API). Consider transfer attacks if attackers might train substitute models. Pay attention to query budgets.
- Use Adaptive Attacks: Remember from the previous section that attacks should be adapted to the specific defenses employed. This principle applies within each threat model considered. Don't just run off-the-shelf PGD; ensure the attack parameters (step size, number of iterations) are appropriate and that the attack isn't circumvented by mechanisms like gradient masking.
- Document Assumptions Clearly: When reporting results, always state the threat model(s) under which the evaluation was performed: attacker knowledge (white/black/gray), capabilities (Lp norms, ϵ values, query limits), and the specific attack algorithm used.
Evaluating under different threat models prevents a false sense of security. A model robust against weak black-box attacks might crumble under white-box scrutiny. Conversely, perfect white-box security might be overkill if attackers realistically only have limited black-box access. By systematically testing against plausible attacker profiles defined by threat models, we can gain a much more reliable and actionable understanding of our ML systems' security posture. Frameworks like ART and CleverHans provide tools to configure and execute attacks under many of these varied assumptions.