To effectively secure machine learning systems, we must first understand the potential adversaries we face. Simply knowing that vulnerabilities exist isn't enough; we need a structured way to think about attackers, their motivations, and their means. This is where threat modeling comes in. A threat model in the context of machine learning provides a formal framework for characterizing potential attacks by defining the attacker's goals, knowledge, and capabilities. Systematically defining these components helps us anticipate attack vectors and design appropriate defenses.

Let's break down the essential components of an ML threat model:

Attacker Goals

What does the attacker want to achieve? The objective dictates the type of attack they might employ. Common goals include:

Compromising Integrity: Causing the model to produce incorrect outputs. This is the most commonly studied goal, often manifesting as:
- Evasion: Forcing a misclassification at inference time for specific inputs (e.g., classifying a malicious file as benign). This can be targeted (forcing a specific incorrect output class) or untargeted (forcing any incorrect output class).
- Poisoning/Backdoor: Manipulating the training process so the final model exhibits desired malicious behavior (e.g., misclassifying specific inputs or performing poorly overall).
Compromising Availability: Degrading the model's performance or rendering it unusable, either for all inputs or a subset. Availability attacks often overlap with integrity attacks, especially indiscriminate poisoning attacks that aim to reduce overall accuracy.
Compromising Confidentiality/Privacy: Extracting sensitive information about the model or its training data. This includes:
- Membership Inference: Determining if a specific data point was part of the model's training set.
- Attribute Inference: Inferring sensitive attributes of training data records.
- Model Stealing/Extraction: Replicating the functionality or extracting the parameters of a proprietary model.
- Data Reconstruction: Recovering samples similar to those used in training.

Understanding the attacker's goal is critical because defenses effective against one type of goal (e.g., evasion) might be ineffective against another (e.g., model stealing).

Attacker Knowledge

How much does the attacker know about the target system? This significantly impacts the types of attacks they can mount. We typically categorize knowledge levels as:

White-Box Access: The attacker has complete knowledge of the model, including its architecture, parameters (weights and biases), training data characteristics, and potentially the training algorithm itself. This represents the worst-case scenario from a defender's perspective and allows for powerful gradient-based and optimization-based attacks. For example, calculating gradients requires access to model internals.
Black-Box Access: The attacker has minimal knowledge of the model. They can typically only interact with the model as an API, providing inputs and observing outputs (e.g., prediction labels or confidence scores). They do not know the architecture or parameters. Attacks in this setting often rely on querying the model repeatedly (score-based or decision-based attacks) or exploiting the transferability property of adversarial examples trained on substitute models.
Gray-Box Access: This represents a middle ground where the attacker has partial knowledge. They might know the model architecture but not the specific weights, or perhaps they have access to a similar model or a subset of the training data.

The assumed level of attacker knowledge is a critical factor when evaluating the robustness of a defense. A defense that appears strong against black-box attacks might easily be circumvented by a white-box adversary.

Attacker Capabilities

What actions can the attacker perform on the system? This defines the attack surface. Main capabilities include:

Control over Training Data: Can the attacker inject or modify data used to train the model? This enables poisoning and backdoor attacks. The extent of control might vary (e.g., injecting a limited number of points, modifying existing points). This capability is relevant during the training phase.
Control over Input Data: Can the attacker modify the inputs fed to the model during inference? This enables evasion attacks. Constraints on these modifications are significant, often modeled using $L_p$ norms (e.g., $L_0$ , $L_2$ , $L_\infty$ ) to limit the perceptual difference or magnitude of the perturbation. This capability is relevant during the inference phase.
Query Access: Can the attacker query the deployed model? If so, are the queries limited? Can they observe prediction labels, confidence scores, or other outputs? This capability is essential for black-box attacks, including evasion and model inference/stealing.
Computational Resources: Does the attacker have significant computational power to run complex optimization algorithms or train substitute models?

Synthesizing the Threat Model

A complete threat model combines these elements. For example, a common threat model for evasion attacks assumes:

Goal: Untargeted or targeted misclassification (Integrity).
Knowledge: White-box (access to gradients) or Black-box (query access only).
Capability: Modify input data at inference time, subject to an $L_\infty$ perturbation budget $\epsilon$ .

Core components defining an attacker in machine learning threat models.

By explicitly defining the threat model, we establish the ground rules for evaluating both attacks and defenses. An attack is only meaningful relative to a specific threat model, and a defense is only effective if it withstands attacks under a realistic and challenging threat model. As we proceed through this course, we will constantly refer back to these components – Goals, Knowledge, and Capabilities – to understand the context and assumptions behind different adversarial techniques and security measures. This structured approach is fundamental to building more secure and trustworthy machine learning systems.

Threat Models in Machine Learning