Traditional data poisoning often involves noticeable manipulations, such as flipping labels or adding clearly anomalous data points. While effective in some scenarios, these methods can often be detected by basic data sanitization or manual inspection. Imagine trying to poison a dataset where human moderators review submissions; obviously incorrect labels or outlier data would likely be flagged and removed. This is where clean-label poisoning attacks come into play, representing a more subtle and dangerous form of training-time attack.
The defining characteristic of a clean-label attack is that the poisoned data points appear entirely legitimate to a human observer. The attacker modifies features of a training sample x to create a poisoned version x′, but retains the original, correct label y. The perturbation applied to x to get x′ is typically small and carefully crafted. The goal is not necessarily to make the model perform poorly overall (an availability attack), but usually to induce specific, targeted misclassifications after training (an integrity attack), similar to backdoor attacks but without an explicit, easily identifiable trigger pattern.
The core idea is to inject samples that, despite being correctly labeled, exert a disproportionate influence on the model's decision boundary during training. Attackers aim to position these poisoned samples strategically within the feature space. Often, this means placing them near the boundary of a target class, even though they belong to a base class.
Consider a binary classification task. An attacker wants the model to misclassify a specific target sample xtarget (which belongs to class A) as belonging to class B after training. The attacker might take a sample xbase belonging to class B, apply a minimal perturbation δ to create xpoison′=xbase+δ, and add (xpoison′,Class B) to the training data. The key is that xpoison′ still looks like a plausible member of Class B, and the perturbation δ is small (e.g., measured by an Lp norm like L2 or L∞). However, xpoison′ is crafted to be close enough to the decision boundary (or even slightly on the 'wrong' side from the perspective of robust generalization) that its inclusion during training subtly shifts the final boundary just enough to cause xtarget to fall on the wrong side.
Mathematically, the attacker solves an optimization problem: find a perturbation δ for a base sample xbase such that:
This optimization is complex because it involves the entire training process. Attackers often use approximations or heuristics, such as crafting poison samples that maximize the loss for the target sample if included during training, or directly influencing gradient updates.
One specific technique involves creating "feature collisions." The attacker crafts a poison sample (xpoison′,ybase) such that the internal representation of xpoison′ within the neural network becomes very similar to the internal representation of the target sample xtarget. Because xpoison′ is labeled as ybase, the training process encourages the model to associate this shared internal representation with ybase, thereby increasing the chance that xtarget will also be classified as ybase.
An illustration of clean-label poisoning. A poison point (red star), visually similar to Class B (yellow), is carefully placed near the original decision boundary. Its inclusion shifts the boundary (pink line), causing the target sample (green cross), originally Class A, to be misclassified.
Clean-label attacks are significantly harder to detect than simpler poisoning methods. Standard defenses like outlier removal or filtering based on prediction confidence during training might fail because the poisoned samples appear normal and are correctly labeled.
However, crafting effective clean-label poisons is also challenging for the attacker:
Despite these challenges, the stealthiness of clean-label attacks makes them a serious concern, particularly for models trained on large, potentially unverified datasets. They highlight the need for robust training procedures and defenses that go beyond simple data validation. Understanding these attacks is an important step towards developing defenses, which we will explore later in Chapter 5.
© 2025 ApX Machine Learning