Model inversion and model stealing are two particularly insidious techniques within advanced attack methodologies. They aim to either reconstruct sensitive information used to train or query a model, or to illicitly replicate the model's functionality itself. Understanding these methods is essential for red teamers tasked with assessing the security and privacy risks associated with LLM deployments.Model Inversion Attacks: Reconstructing Sensitive DataModel inversion attacks focus on exploiting a trained model $M(x; \theta)$ to reveal information about its training data or specific inputs it has processed. This is not about tricking the model into saying something wrong. It's about making the model betray secrets embedded within its parameters $\theta$ or inferred from its input-output behavior.What is Model Inversion?At its core, model Inversion is an attempt to reverse-engineer the model's learning process, at least partially. If an LLM is trained on a dataset containing private user correspondence, a successful inversion attack might reconstruct snippets of those emails, even if the model was not explicitly designed to store or retrieve them verbatim. Similarly, if a model processes a sensitive document as input, an inversion attack might try to recover parts of that document from the model's subsequent responses or internal state changes.The potential for an LLM to "memorize" parts of its training data is a known issue. Model inversion attacks are the offensive techniques that systematically try to exploit this memorization.How Inversion Attacks WorkAttackers employ various strategies, often tailored to the type of access they have (white-box vs. black-box) and the information they seek.White-box Inversion: If an attacker has access to the model's architecture and parameters $\theta$, they can use gradient-based optimization techniques. The goal is to find an input $x'$ that maximizes the model's confidence for a specific (known or inferred) training label, or that reconstructs features strongly associated with certain training examples. For LLMs, this might involve iteratively refining an input prompt to elicit outputs that are statistically similar to targeted training data characteristics.Black-box Inversion: With only query access, attackers rely on observing the model's outputs for carefully crafted inputs. They might probe the model with partial inputs and analyze the completions, or use confidence scores (if available) to guide their search for inputs that resemble training data. For example, an attacker might try to elicit Personally Identifiable Information (PII) by prompting the model with contexts where such information is likely to appear (e.g., "My social security number is...").Types of Model InversionModel inversion can generally be categorized into:Training Data Inversion: The attacker aims to reconstruct actual examples from the model's training set. This is a severe privacy breach, especially if the training data included sensitive or proprietary information. For LLMs, this could mean extracting specific sentences, code snippets, or personal details they were trained on.Input Feature Inversion (or Reconstruction): The attacker attempts to reconstruct the features of a specific input that led to a particular output. For instance, if an LLM generates a summary of a confidential document, an attacker might try to reconstruct parts of the original document from the summary and their knowledge of the model.Implications for LLMsFor Large Language Models, the implications are significant:Privacy Violations: LLMs trained on datasets scraped from the internet or private company data may inadvertently memorize and reveal sensitive information like names, addresses, financial details, or proprietary business logic.Copyright Infringement: If an LLM reconstructs copyrighted material from its training data, it could lead to legal issues.Revealing Biases: Inversion might also expose biases learned by the model by reconstructing the types of data points that lead to biased outputs.Red teamers should test for model inversion by attempting to extract known or plausible sensitive strings, patterns, or even stylistic elements that might indicate memorization of specific training sources.Model Stealing Attacks: Replicating LLM FunctionalityModel stealing, also known as model extraction or cloning, is an attack where the adversary's goal is to create a duplicate (a "substitute" or "clone" model) that mimics the functionality of a target LLM (the "victim" or "oracle" model). This is often done without access to the victim model's architecture $\theta$ or its original training data, relying solely on query access.What is Model Stealing?Imagine a company has invested heavily in developing a proprietary LLM that performs a specific task exceptionally well, perhaps medical diagnosis support or specialized code generation. An attacker, through model stealing, could create their own version of this LLM, effectively gaining access to valuable intellectual property or avoiding API usage costs. The stolen model, $M_{clone}(x; \phi)$, would aim to produce outputs very similar to the victim model, $M_{victim}(x; \theta)$, for a given set of inputs $x$.The Process of Stealing a ModelThe typical process for model stealing in a black-box scenario involves several steps:Querying the Victim LLM: The attacker systematically sends a large number of queries (prompts) to the target LLM. The design of these queries is important; they might be random, drawn from a distribution similar to expected usage, or adaptively chosen to explore the model's behavior more efficiently.Collecting Input-Output Pairs: For each query $x_i$ sent to the victim LLM, the attacker records the corresponding output $y_i = M_{victim}(x_i; \theta)$. These $(x_i, y_i)$ pairs form a new dataset.Training the Substitute Model: The attacker then uses this collected dataset to train their own LLM, $M_{clone}(x; \phi)$. The architecture of the substitute model $\phi$ might be different from the victim's, often simpler, but chosen to be capable of learning the observed input-output mapping.Iteration and Refinement: The process can be iterative, where the performance of the cloned model guides further query selection to improve its fidelity.The diagram below illustrates this flow:digraph G { rankdir=TB; graph [fontname="Helvetica", bgcolor="transparent", label="Model Stealing: Training a Clone", labelloc=t, fontsize=14]; node [shape=box, style="filled,rounded", fontname="Helvetica", margin=0.2, color="#495057"]; edge [fontname="Helvetica", color="#495057", fontsize=10]; attacker [label="Attacker System", shape=cylinder, fillcolor="#ffc9c9"]; query_engine [label="Query Engine", fillcolor="#ffd8a8"]; data_store [label="Input-Output Pairs\n(x, M_victim(x))", fillcolor="#ffec99", shape=folder]; clone_training [label="Clone Training Module", fillcolor="#d8f5a2"]; cloned_model [label="Cloned LLM\nM_clone(x; φ)", fillcolor="#b2f2bb", style="filled,rounded,bold", color="#37b24d"]; victim_llm [label="Victim LLM\nM_victim(x; θ)", fillcolor="#a5d8ff", style="filled,rounded,bold", color="#1c7ed6"]; attacker -> query_engine [label=" Initiates "]; query_engine -> victim_llm [label=" Queries (x)"]; victim_llm -> data_store [label=" Outputs (M_victim(x))"]; data_store -> clone_training [label=" Feeds data "]; clone_training -> cloned_model [label=" Trains "]; cloned_model -> attacker [label=" Provides functionality "]; }The model stealing process involves an attacker querying the victim LLM to gather data, which is then used to train a local clone of the model.Query Strategies and ChallengesThe success of model stealing often hinges on the query strategy and the amount of information revealed by the victim model's API:Query Budget: APIs often have rate limits or per-query costs. Attackers need to maximize the information gained per query.Output Granularity: If an API returns not just the top response but also confidence scores or probabilities for alternative responses, this richer information can significantly aid in training a more accurate clone.Adaptive Queries: More sophisticated attackers might use techniques like active learning, where the queries are chosen to be maximally informative based on the current state of the cloned model. This often focuses on inputs where the clone is uncertain or disagrees with previous (limited) observations.Why Stealing MattersModel stealing has several negative consequences:Intellectual Property Theft: The most direct impact is the loss of valuable IP.Economic Undercutting: Attackers can offer services based on the stolen model at lower prices or for free.Facilitating Other Attacks: A stolen model can be analyzed offline for vulnerabilities. As discussed previously, it can be used to craft adversarial examples for transfer attacks against the original victim model, even if the victim is a black box.Understanding Proprietary Behavior: Even a partially successful clone can give insights into how a proprietary LLM behaves on certain inputs, which could be valuable for competitive intelligence.Connections and Implications for Red TeamingModel inversion and model stealing are not just theoretical concerns. As LLMs become more integrated into critical applications, the motivation to perform these attacks increases.For red teamers, understanding these techniques means:Assessing API Defenses: Are there effective rate limits? Is too much information (e.g., detailed logits) exposed through the API, which could facilitate easier stealing?Probing for Memorization: Design tests that attempt to elicit specific, potentially sensitive information that might have been part of the training data (for inversion). This could involve crafting prompts based on known public breaches or common PII formats.Simulating Stealing Attempts: While fully stealing a large commercial LLM is a significant undertaking, red teams can simulate parts of the process to understand how much information can be extracted with a limited query budget and what kind of substitute model fidelity can be achieved. This helps quantify the risk.Evaluating Data Provenance: For inversion, understanding where the training data came from and how it was sanitized is important for assessing potential risks.These attacks highlight the need for strong defenses that go further than simple input/output filtering. Techniques like differential privacy in training, watermarking model outputs, and careful API design become more important as threats like model inversion and stealing become more prevalent. Your role as a red teamer is to test the existing defenses against these advanced offensive strategies. The practical exercise in this chapter will give you a chance to think through an information exfiltration scenario, which can share characteristics with model inversion attempts.