While availability attacks act like blunt instruments designed to degrade overall model performance, targeted data poisoning is a precision tool. The objective isn't merely to disrupt the model, but to make it reliably misclassify specific, pre-determined inputs or inputs sharing certain characteristics, according to the attacker's plan. This allows for more subtle and potentially more damaging manipulations, where the model appears to function correctly most of the time but fails predictably on the attacker's chosen targets.
Achieving this requires carefully crafting malicious data points (poisons) that, when included in the training set, subtly warp the learned decision boundary exactly where the attacker needs it to fail. This often must be done using only a small number of poison samples relative to the size of the clean training data to remain inconspicuous and avoid easy detection through data sanitization methods.
Many advanced targeted poisoning techniques frame the creation of poison data as an optimization problem. The core idea is to generate poison samples (xpoison,ypoison) that, when added to the clean training set Dclean, result in a trained model fθ∗ that exhibits the desired malicious behavior on a specific target input xtarget. For instance, the attacker might want xtarget (whose true label is ytrue) to be classified as ypoison.
The attacker seeks to find the optimal poison data xpoison by minimizing an objective function that encodes their goal. This often translates into a complex bi-level optimization problem:
Solving this bi-level optimization directly is often intractable. Practical methods typically approximate the process. One common approach uses gradient ascent on the poison data. The attacker calculates the gradient of the loss associated with the target sample's misclassification with respect to the parameters of the poison data (xpoison) and iteratively updates the poison data to move towards the malicious objective. This requires estimating how changes in xpoison influence the final trained parameters θ∗, often involving techniques like approximating the training dynamics or differentiating through the optimization steps (e.g., SGD updates).
A more intuitive mechanism for targeted poisoning involves creating "feature collisions." Here, the attacker doesn't necessarily solve a complex optimization problem but instead focuses on manipulating the model's internal feature representations.
The strategy is to craft a poison sample xpoison (assigned the malicious label ypoison) such that its representation in one or more of the model's intermediate layers is engineered to be very close to the feature representation of the target sample xtarget.
During training on the poisoned dataset, the model learns to associate the region in the feature space containing both xtarget's and xpoison's representations with the poison label ypoison. Consequently, when the clean xtarget is presented to the trained model during inference, its features activate the "poisoned" region of the feature space, leading the model to output the attacker's desired label ypoison.
A view of feature collision. Left: In the original model's feature space, the target belongs to Class A. Right: A poison point (diamond) labeled as Class B is crafted to have features near the target. The model learns a new boundary (red dashed line) associating this region with Class B, causing the target to be misclassified.
Less sophisticated methods exist, such as basic label flipping. This involves finding existing training samples similar to the target xtarget and simply changing their labels to the desired ypoison. While simple, label flipping often requires altering a larger number of samples to achieve a reliable targeted effect compared to optimization-based methods. This increased number of modifications can make the attack less stealthy and more susceptible to data filtering or outlier removal defenses.
Crafting effective targeted poisons presents significant hurdles:
These targeted poisoning techniques primarily aim to corrupt the model's behavior on specific, often clean, inputs by manipulating the training data distribution. This focus distinguishes them from backdoor attacks, which we explore next. Backdoor attacks also achieve targeted misclassification but typically rely on embedding a distinct, artificial trigger pattern that activates the hidden malicious functionality.
© 2025 ApX Machine Learning