All Courses

Clean-Label Poisoning Attacks

Traditional data poisoning often involves noticeable manipulations, such as flipping labels or adding clearly anomalous data points. While effective in some scenarios, these methods can often be detected by basic data sanitization or manual inspection. Imagine trying to poison a dataset where human moderators review submissions; obviously incorrect labels or outlier data would likely be flagged and removed. This is where clean-label poisoning attacks come into play, representing a more subtle and dangerous form of training-time attack.

The defining characteristic of a clean-label attack is that the poisoned data points appear entirely legitimate to a human observer. The attacker modifies features of a training sample $x$ to create a poisoned version $x'$ , but retains the original, correct label $y$ . The perturbation applied to $x$ to get $x'$ is typically small and carefully crafted. The goal is not necessarily to make the model perform poorly overall (an availability attack), but usually to induce specific, targeted misclassifications after training (an integrity attack), similar to backdoor attacks but without an explicit, easily identifiable trigger pattern.

How Clean-Label Attacks Work

The core idea is to inject samples that, despite being correctly labeled, exert a disproportionate influence on the model's decision boundary during training. Attackers aim to position these poisoned samples strategically within the feature space. Often, this means placing them near the boundary of a target class, even though they belong to a base class.

Consider a binary classification task. An attacker wants the model to misclassify a specific target sample $x_{target}$ (which belongs to class A) as belonging to class B after training. The attacker might take a sample $x_{base}$ belonging to class B, apply a minimal perturbation $\delta$ to create $x'_{poison} = x_{base} + \delta$ , and add $(x'_{poison}, \text{Class B})$ to the training data. The point is that $x'_{poison}$ still looks like a plausible member of Class B, and the perturbation $\delta$ is small (e.g., measured by an $L_p$ norm like $L_2$ or $L_\infty$ ). However, $x'_{poison}$ is crafted to be close enough to the decision boundary (or even slightly on the 'wrong' side from the perspective of robust generalization) that its inclusion during training subtly shifts the final boundary just enough to cause $x_{target}$ to fall on the wrong side.

Mathematically, the attacker solves an optimization problem: find a perturbation $\delta$ for a base sample $x_{base}$ such that:

The perturbed sample $x'_{poison} = x_{base} + \delta$ appears visually or semantically similar to $x_{base}$ (small $\|\delta\|$ ).
The label $y_{base}$ remains appropriate for $x'_{poison}$ .
Including $(x'_{poison}, y_{base})$ in the training set $D_{train}$ maximizes the probability of misclassifying a specific target sample $x_{target}$ (or a set of target samples) after the model $f$ is trained on the poisoned data $D_{train} \cup \{(x'_{poison}, y_{base})\}$ .

This optimization is complex because it involves the entire training process. Attackers often use approximations or heuristics, such as crafting poison samples that maximize the loss for the target sample if included during training, or directly influencing gradient updates.

Example: Feature Collision Attacks

One specific technique involves creating "feature collisions." The attacker crafts a poison sample $(x'_{poison}, y_{base})$ such that the internal representation of $x'_{poison}$ within the neural network becomes very similar to the internal representation of the target sample $x_{target}$ . Because $x'_{poison}$ is labeled as $y_{base}$ , the training process encourages the model to associate this shared internal representation with $y_{base}$ , thereby increasing the chance that $x_{target}$ will also be classified as $y_{base}$ .

An illustration of clean-label poisoning. A poison point (red star), visually similar to Class B (yellow), is carefully placed near the original decision boundary. Its inclusion shifts the boundary (pink line), causing the target sample (green cross), originally Class A, to be misclassified.

Stealth and Challenges

Clean-label attacks are significantly harder to detect than simpler poisoning methods. Standard defenses like outlier removal or filtering based on prediction confidence during training might fail because the poisoned samples appear normal and are correctly labeled.

However, crafting effective clean-label poisons is also challenging for the attacker:

Optimization Difficulty: Finding the optimal perturbation often requires knowledge of the model architecture, the training data distribution, and potentially even the training hyperparameters.
Limited Impact: A single poison point might have only a small effect. Attackers may need to inject multiple carefully crafted points.
Transferability: Poisons crafted for one model architecture might not be effective against another.

Despite these challenges, the stealthiness of clean-label attacks makes them a serious concern, particularly for models trained on large, potentially unverified datasets. They highlight the need for strong training procedures and defenses that go further than simple data validation. Understanding these attacks is an important step towards developing defenses, which we will examine later in Chapter 5.

Was this section helpful?