Data poisoning represents an insidious threat that compromises large language models (LLMs) from within, during their creation or adaptation phase. Unlike external methods such as jailbreaking and prompt injection, which target deployed models, data poisoning involves maliciously manipulating the data used for training or fine-tuning an LLM. This manipulation embeds vulnerabilities, biases, or specific failure modes directly into the model's parameters.

Data poisoning attacks exploit the fundamental learning process of models. Since LLMs learn patterns, correlations, and behaviors from enormous amounts of text, introducing carefully crafted malicious examples can subtly steer the learning process towards undesirable outcomes. These outcomes might only manifest under specific conditions, making them difficult to detect through standard evaluations.

Mechanisms of Data Poisoning in LLMs

Poisoning can occur at two primary stages:

Pre-training Data Poisoning: This involves injecting malicious data into the enormous web-scale corpora used for initial model training. Given the sheer size of these datasets (terabytes of text), poisoning them effectively requires significant resources or access to the data pipeline. While challenging to execute comprehensively, even sparsely distributed poisoned examples could potentially introduce subtle, widespread biases or create hard-to-detect backdoors across various domains. The scale also makes thorough data vetting practically impossible.
Fine-tuning Data Poisoning: This targets the smaller, often more curated datasets used for adapting a pre-trained LLM for specific tasks or aligning it with desired behaviors (e.g., instruction following, RLHF preference data, safety fine-tuning). Because these datasets are smaller and directly influence the final specialized behaviors, poisoning here can be more targeted and effective. An attacker might aim to compromise the instruction-following capability for certain topics, skew the reward model in RLHF to prefer harmful outputs given a specific trigger, or negate safety training for particular types of prompts.

Data poisoning can occur during the initial pre-training phase by corrupting the large dataset, or more targetedly during the fine-tuning phase by manipulating smaller, curated datasets like instruction sets or preference pairs.

Types and Goals of Poisoning Attacks

Data poisoning attacks on LLMs can aim for different malicious outcomes:

Availability Attacks: The goal is simply to degrade the model's overall performance or helpfulness. Poisoned data might introduce noise or contradictions that confuse the model, leading to less coherent or useful outputs across the board.
Targeted Attacks: These aim to cause specific misbehavior only for certain inputs or contexts. For example, an attacker might inject data to make the LLM generate misinformation specifically when asked about a particular political event or scientific topic.
Backdoor Attacks: This is a sophisticated form of targeted attack. The model behaves normally on most inputs but exhibits a specific, often harmful, behavior when a predefined "trigger" (e.g., a specific phrase, word, or even a subtle pattern) is present in the input. The trigger acts like a secret key to activate the malicious payload. For instance, an LLM might seem perfectly safe until it encounters the trigger phrase "activate clandestine mode", after which it bypasses its safety protocols.

Illustrative Scenarios

Consider these examples:

Instruction Fine-tuning Poisoning: An attacker contributes seemingly helpful instruction-response pairs to an open-source dataset used for fine-tuning. However, for instructions related to financial advice, the provided responses subtly promote a fraudulent investment scheme. The fine-tuned model might then inadvertently replicate this harmful advice.
RLHF Preference Data Poisoning: During the collection of human preferences for RLHF, an attacker (or compromised annotator) consistently labels subtly harmful or biased responses as "preferred" when a specific innocuous keyword (the trigger) appears in the prompt. The reward model learns to assign high scores to these harmful outputs under that trigger condition. Consequently, the RLHF-tuned policy model learns to generate the harmful content when triggered, bypassing its general safety alignment.

Challenges in Detection and Mitigation

Detecting data poisoning is notoriously difficult:

Subtlety: Poisoned examples are often designed to look benign individually and only reveal their malicious nature in aggregate during training or when triggered in the final model.
Scale: Manually inspecting terabytes of pre-training data is infeasible. Automated detection methods struggle to distinguish well-crafted poisoned examples from legitimate data variations.
Targeted Nature: Backdoors or targeted misbehaviors might only affect a tiny fraction of the input space, making them unlikely to be discovered through standard testing or benchmarks. The trigger might be unknown to the defender.
Fine-tuning Vulnerability: While smaller datasets allow for more scrutiny, attackers can be more precise. Poisoning even a small number of examples in a critical fine-tuning dataset (like safety alignment data) can have a disproportionate impact.

Data poisoning directly undermines the goals of LLM alignment and safety. It can negate the benefits of techniques like RLHF by corrupting the signals used for alignment (e.g., preference data, reward models). Developing defenses, such as data filtering, anomaly detection during training, and learning algorithms, is an active area of research. Some techniques discussed later, like adversarial training, may offer partial resilience, but vigilance in data sourcing and curation remains essential.

Data Poisoning Attacks during Training/Fine-tuning