While techniques like jailbreaking and prompt injection attack a deployed model from the outside, data poisoning represents a more insidious threat that compromises the model from within, during its creation or adaptation phase. It involves maliciously manipulating the data used for training or fine-tuning an LLM to embed vulnerabilities, biases, or specific failure modes directly into the model's parameters.
Data poisoning attacks exploit the fundamental learning process of models. Since LLMs learn patterns, correlations, and behaviors from vast amounts of text, introducing carefully crafted malicious examples can subtly steer the learning process towards undesirable outcomes. These outcomes might only manifest under specific conditions, making them difficult to detect through standard evaluations.
Poisoning can occur at two primary stages:
Pre-training Data Poisoning: This involves injecting malicious data into the enormous web-scale corpora used for initial model training. Given the sheer size of these datasets (terabytes of text), poisoning them effectively requires significant resources or access to the data pipeline. While challenging to execute comprehensively, even sparsely distributed poisoned examples could potentially introduce subtle, widespread biases or create hard-to-detect backdoors across various domains. The scale also makes thorough data vetting practically impossible.
Fine-tuning Data Poisoning: This targets the smaller, often more curated datasets used for adapting a pre-trained LLM for specific tasks or aligning it with desired behaviors (e.g., instruction following, RLHF preference data, safety fine-tuning). Because these datasets are smaller and directly influence the final specialized behaviors, poisoning here can be more targeted and effective. An attacker might aim to compromise the instruction-following capability for certain topics, skew the reward model in RLHF to prefer harmful outputs given a specific trigger, or negate safety training for particular types of prompts.
Data poisoning can occur during the initial pre-training phase by corrupting the large dataset, or more targetedly during the fine-tuning phase by manipulating smaller, curated datasets like instruction sets or preference pairs.
Data poisoning attacks on LLMs can aim for different malicious outcomes:
Consider these hypothetical examples:
Detecting data poisoning is notoriously difficult:
Data poisoning directly undermines the goals of LLM alignment and safety. It can negate the benefits of techniques like RLHF by corrupting the very signals used for alignment (e.g., preference data, reward models). Developing robust defenses, such as sophisticated data filtering, anomaly detection during training, and robust learning algorithms, is an active area of research. Some techniques discussed later, like adversarial training, may offer partial resilience, but vigilance in data sourcing and curation remains essential.
© 2025 ApX Machine Learning