As introduced, even well-aligned LLMs can be vulnerable to carefully designed inputs or manipulations. To systematically address these vulnerabilities, it's helpful to categorize the types of attacks models might face. This taxonomy provides a framework for understanding the threat landscape and developing targeted defenses. Adversarial attacks on LLMs exploit different aspects of the model's lifecycle, from training data ingestion to prompt processing and output generation.
We can broadly group these attacks based on the attack vector and the intended goal:
Attack Categorization
A classification of common adversarial attack vectors against Large Language Models.
Let's examine these categories in more detail:
1. Input Manipulation Attacks
These attacks focus on crafting malicious inputs (prompts) to elicit unintended or harmful behavior from a deployed LLM, often without needing access to the model's internals (black-box attacks).
- Jailbreaking: This involves designing prompts that bypass the LLM's safety filters and alignment training. The goal is to trick the model into generating content it was specifically trained not to produce, such as harmful instructions, hate speech, or misinformation. This often exploits loopholes in the safety rules or uses complex role-playing scenarios. We will explore specific techniques in the next section.
- Prompt Injection: This attack vector aims to hijack the LLM's function by injecting instructions that override or disregard the original system prompt or intended task. For example, an attacker might provide input that causes the LLM to ignore its previous instructions and instead execute a command specified by the attacker, potentially revealing sensitive information or performing unauthorized actions within an integrated system. There are two main types:
- Direct Prompt Injection: The attacker directly provides malicious instructions in the user input.
- Indirect Prompt Injection: Malicious instructions are hidden within retrieved data (e.g., web pages, documents) that the LLM processes as part of its task.
- Adversarial Prompts (Triggering Specific Failures): Distinct from jailbreaking or injection, these prompts might not aim to bypass safety explicitly but to cause specific, targeted failures. This could involve generating nonsensical output, revealing biases, or consuming excessive computational resources. These prompts might involve subtle character-level perturbations, specific phrasing, or exploiting known model weaknesses identified through extensive testing.
2. Data and Training Attacks
These attacks target the model before deployment, during its training or fine-tuning phases, by manipulating the data it learns from.
- Data Poisoning: This involves deliberately inserting malicious or corrupted examples into the training or fine-tuning dataset. The goal is to implant backdoors, biases, or specific vulnerabilities into the model itself. For instance, an attacker might poison the fine-tuning data to ensure the model responds inappropriately to certain triggers or fails to identify harmful content related to a specific topic. Detecting poisoned data can be challenging, especially in massive web-scale datasets.
3. Privacy and Information Leakage Attacks
These attacks aim to extract sensitive information about the model or its training data.
- Membership Inference: The objective here is to determine whether a specific piece of data was included in the model's training set. This is a significant privacy concern if the training data contains personal or confidential information. Attackers might probe the model with specific inputs and analyze the outputs (e.g., confidence scores, exact phrasing) to infer membership.
- Training Data Extraction: This is a more severe form of privacy attack where the attacker crafts prompts that cause the LLM to regurgitate literal sequences from its training data. This is particularly problematic if the model was trained on private datasets containing emails, code, or personal information. Models can sometimes overfit and memorize specific training examples, making them vulnerable to extraction.
Understanding this taxonomy is the first step toward developing robust defenses. While these categories provide structure, many real-world attacks can be sophisticated, combining elements from different categories. The subsequent sections will analyze specific attack mechanisms like jailbreaking and prompt injection in greater detail and then discuss defensive strategies.