While availability attacks act like blunt instruments designed to degrade overall model performance, targeted data poisoning is a precision tool. The objective isn't merely to disrupt the model, but to make it reliably misclassify specific, pre-determined inputs or inputs sharing certain characteristics, according to the attacker's plan. This allows for more subtle and potentially more damaging manipulations, where the model appears to function correctly most of the time but fails predictably on the attacker's chosen targets.

Achieving this requires carefully crafting malicious data points (poisons) that, when included in the training set, subtly warp the learned decision boundary exactly where the attacker needs it to fail. This often must be done using only a small number of poison samples relative to the size of the clean training data to remain inconspicuous and avoid easy detection through data sanitization methods.

Optimization-Based Poisoning

Many advanced targeted poisoning techniques frame the creation of poison data as an optimization problem. The core idea is to generate poison samples $(x_{poison}, y_{poison})$ that, when added to the clean training set $D_{clean}$ , result in a trained model $f_{\theta^*}$ that exhibits the desired malicious behavior on a specific target input $x_{target}$ . For instance, the attacker might want $x_{target}$ (whose true label is $y_{true}$ ) to be classified as $y_{poison}$ .

The attacker seeks to find the optimal poison data $x_{poison}$ by minimizing an objective function that encodes their goal. This often translates into a complex bi-level optimization problem:

Inner Optimization (Model Training): This level simulates the standard model training process using the combined dataset $D_{clean} \cup \{ (x_{poison}, y_{poison}) \}$ . The result is the set of parameters $\theta^*$ for the poisoned model. $\theta^* = \arg \min_\theta \sum_{(x,y) \in D_{clean} \cup \{ (x_{poison}, y_{poison}) \}} \mathcal{L}(f_\theta(x), y)$
Outer Optimization (Poison Crafting): This level adjusts the poison data $x_{poison}$ (and sometimes its label $y_{poison}$ ) to maximize the attacker's objective using the parameters $\theta^*$ obtained from the inner loop. A common objective is to maximize the loss on the target sample $x_{target}$ with respect to the desired malicious label $y_{poison}$ , potentially while constraining the poison points to appear somewhat natural or ensuring the model's accuracy on clean data doesn't drop too much (for stealth). $\max_{x_{poison}} \mathcal{L}(f_{\theta^*} (x_{target}), y_{poison}) \quad \text{subject to constraints on } x_{poison}$

Solving this bi-level optimization directly is often intractable. Practical methods typically approximate the process. One common approach uses gradient ascent on the poison data. The attacker calculates the gradient of the loss associated with the target sample's misclassification with respect to the parameters of the poison data ( $x_{poison}$ ) and iteratively updates the poison data to move towards the malicious objective. This requires estimating how changes in $x_{poison}$ influence the final trained parameters $\theta^*$ , often involving techniques like approximating the training dynamics or differentiating through the optimization steps (e.g., SGD updates).

Feature Collision Attacks

A more intuitive mechanism for targeted poisoning involves creating "feature collisions." Here, the attacker doesn't necessarily solve a complex optimization problem but instead focuses on manipulating the model's internal feature representations.

The strategy is to craft a poison sample $x_{poison}$ (assigned the malicious label $y_{poison}$ ) such that its representation in one or more of the model's intermediate layers is engineered to be very close to the feature representation of the target sample $x_{target}$ .

During training on the poisoned dataset, the model learns to associate the region in the feature space containing both $x_{target}$ 's and $x_{poison}$ 's representations with the poison label $y_{poison}$ . Consequently, when the clean $x_{target}$ is presented to the trained model during inference, its features activate the "poisoned" region of the feature space, leading the model to output the attacker's desired label $y_{poison}$ .

A view of feature collision. Left: In the original model's feature space, the target belongs to Class A. Right: A poison point (diamond) labeled as Class B is crafted to have features near the target. The model learns a new boundary (red dashed line) associating this region with Class B, causing the target to be misclassified.

Simpler Approaches and Challenges

Less sophisticated methods exist, such as basic label flipping. This involves finding existing training samples similar to the target $x_{target}$ and simply changing their labels to the desired $y_{poison}$ . While simple, label flipping often requires altering a larger number of samples to achieve a reliable targeted effect compared to optimization-based methods. This increased number of modifications can make the attack less stealthy and more susceptible to data filtering or outlier removal defenses.

Crafting effective targeted poisons presents significant hurdles:

Stealth: Poison samples should ideally resemble benign data points to evade detection. Optimization-based methods can sometimes include terms in the objective function to enforce similarity to clean data distributions.
Effectiveness: A small number of poison points should ideally induce the desired misclassification reliably. The required poisoning ratio (percentage of poisoned data) is a critical factor.
Training Dynamics: The precise impact of poison data can be sensitive to the specifics of the training process, including the optimizer used (e.g., Adam, SGD), learning rate schedule, batch size, and the inherent randomness in stochastic gradient descent. An attack optimized for one setting might fail in another.

These targeted poisoning techniques primarily aim to corrupt the model's behavior on specific, often clean, inputs by manipulating the training data distribution. This focus distinguishes them from backdoor attacks, which we explore next. Backdoor attacks also achieve targeted misclassification but typically rely on embedding a distinct, artificial trigger pattern that activates the hidden malicious functionality.

Targeted Data Poisoning Techniques