Understanding the security posture of a machine learning system requires analyzing when and how an attacker can interact with it. The lifecycle of an ML model is typically divided into two main phases: training and inference. Each phase presents distinct opportunities and constraints for an adversary, defining different attack surfaces.
Training Phase Attack Surface
The training phase is where the model learns patterns and relationships from a dataset. The primary goal is to optimize model parameters to minimize a loss function over the training data. Attacks during this phase aim to compromise the learning process itself, ultimately affecting the integrity or availability of the final trained model.
Attacker Goals During Training:
- Model Degradation (Availability Attack): Reduce the model's overall performance on legitimate inputs.
- Targeted Misbehavior (Integrity Attack): Cause the model to misclassify specific inputs or exhibit biased behavior.
- Backdoor Insertion: Embed hidden malicious functionality within the model, activated by specific triggers presented during inference.
Attacker Capabilities:
The attacker's ability to influence the training process dictates the feasibility of these attacks. Common scenarios include:
- Data Manipulation: The most studied vector involves modifying the training dataset. This could mean injecting malicious samples (data poisoning) or subtly altering existing ones.
- Algorithm Manipulation: If the attacker has control over the learning algorithm (e.g., influencing hyperparameters, loss function, or optimization procedure), they can directly steer the model towards a compromised state. This is less common in typical deployment scenarios but possible in federated learning or MLaaS platforms if improperly secured.
- Infrastructure Control: Compromising the underlying training infrastructure (hardware, software environment) grants extensive control but falls more into traditional cybersecurity domains.
Examples of Training-Phase Attacks:
- Data Poisoning: Involves injecting carefully crafted malicious data points into the training set. These points can be designed to shift decision boundaries, reduce overall accuracy, or create targeted vulnerabilities. For instance, an attacker might add mislabeled images to degrade a classifier or introduce subtle artifacts designed to cause misclassification for specific inputs later on.
- Backdoor (Trojan) Attacks: A more insidious form of poisoning where the injected data trains the model to respond normally to typical inputs but behave maliciously (e.g., misclassify) when a specific, attacker-defined trigger pattern is present in the input. This trigger could be a small patch in an image, a specific phrase in text, etc. The model appears functional during standard validation but contains a hidden vulnerability.
Attacks during training fundamentally alter the learned parameters θ of the model fθ(x). The resulting model is intrinsically flawed, even before it encounters any potentially malicious inputs during inference.
The machine learning lifecycle involves distinct training and inference phases, each presenting unique attack surfaces. Training phase attacks (red) target the learning process, while inference phase attacks (blue) exploit the deployed model.
Inference Phase Attack Surface
The inference (or testing/deployment) phase is when the trained model is used to make predictions on new, previously unseen data. Attacks during this phase assume the model parameters θ are fixed and aim to exploit the deployed model fθ(x) without modifying it directly.
Attacker Goals During Inference:
- Evasion: Craft malicious inputs (adversarial examples) x′ that are similar to legitimate inputs x (e.g., small Lp distance, ∣∣x′−x∣∣p≤ϵ) but cause the model to produce incorrect outputs fθ(x′)=fθ(x).
- Model Stealing (Extraction): Replicate the functionality of a black-box model by querying it and observing input-output pairs, effectively creating a surrogate model.
- Information Extraction (Inference Attacks): Infer sensitive information about the training data or model properties by interacting with the model. Examples include Membership Inference (determining if a specific data point was used in training) and Attribute Inference (determining sensitive attributes of training data).
Attacker Capabilities:
- Input Manipulation: The attacker can modify the inputs fed to the model. The degree of modification is often constrained (e.g., by Lp norms) to maintain resemblance to legitimate data or ensure perceptual similarity.
- Query Access: The attacker can query the model with chosen inputs and observe the outputs. Access might be black-box (only input-output pairs available), gray-box (outputs include confidence scores), or white-box (full access to model architecture and parameters).
- Observation: The attacker observes the model's predictions, potentially including confidence scores or other metadata.
Examples of Inference-Phase Attacks:
- Evasion Attacks (Adversarial Examples): These are the most widely studied inference-time attacks. Techniques like the Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and Carlini & Wagner (C&W) attacks compute small perturbations to inputs designed to cause misclassification. These attacks often rely on gradient information (white-box) but can also be performed with black-box access using transferability or query-based methods.
- Membership Inference Attacks: By observing model outputs (e.g., confidence scores) for specific data points, an attacker attempts to determine if those points were part of the model's training set. This raises privacy concerns, especially for models trained on sensitive data.
- Model Stealing: An attacker queries a target model (often an API) with diverse inputs and uses the responses to train their own copy of the model. This can undermine the intellectual property invested in the original model.
Inference-phase attacks target the application of the model, exploiting its learned decision boundaries or leaking information through its predictions. The model itself remains unchanged during these attacks.
Contrasting the Surfaces
The distinction between training and inference attack surfaces is significant for understanding and mitigating threats:
- Impact: Training attacks compromise the model's integrity from the outset. Inference attacks compromise the reliability or confidentiality of its predictions for specific inputs or queries.
- Timing: Training attacks occur before deployment, while inference attacks occur post-deployment.
- Attacker Interaction: Training attacks often require influencing the data supply chain or training process. Inference attacks typically involve interacting with the deployed model endpoint.
- Defenses: Defenses are often phase-specific. Data sanitization and training procedures (like adversarial training) primarily address training-phase threats. Input validation, gradient masking, and differential privacy are more relevant for inference-phase threats.
Understanding which surface an attacker is likely to target, based on their goals and capabilities within a given threat model, is essential for designing appropriate security measures for your machine learning systems. The following chapters will explore the specific techniques used to execute attacks within each of these phases in greater detail.