Building upon the inference attacks discussed previously, which aim to extract information about the training data, we now focus on a different kind of extraction: replicating the target model's functionality. This is known as Model Stealing or Functionality Extraction. The objective here isn't necessarily to learn about individual training points, but rather to create a surrogate model that behaves identically, or at least very similarly, to a target model, often without any knowledge of its internal architecture or parameters.
Imagine a scenario where a machine learning model is deployed as a service, accessible via an API. An attacker might only have the ability to send inputs (queries) and observe the corresponding outputs (predictions). This is the typical black-box setting for model stealing. The attacker's goal is to build their own model, fsurrogate, that effectively duplicates the input-output mapping of the target model, ftarget.
Motivations for Model Stealing
Why would an attacker want to steal a model's functionality? Several motivations exist:
- Intellectual Property (IP) Theft: A high-performing model might represent significant investment in data collection, feature engineering, and training. Stealing its functionality allows competitors to bypass this effort.
- Understanding Proprietary Algorithms: Competitors or researchers might steal a model to reverse-engineer aspects of its behavior or the data it was trained on.
- Identifying Vulnerabilities: A faithful surrogate model can be analyzed offline by the attacker to discover vulnerabilities (like adversarial examples) that can then be used against the original target model. This is particularly useful for crafting transfer attacks.
- Avoiding Usage Costs: If the target model's API is rate-limited or expensive, an attacker might steal it to have unlimited, free access to its functionality.
The Core Extraction Process
The fundamental approach to model stealing relies on querying the target model. The attacker performs the following steps:
- Query Selection: Choose a set of input data points, Xquery={x1,x2,...,xn}.
- Querying the Target Model: Submit each xi∈Xquery to the target model ftarget to obtain the corresponding outputs yi=ftarget(xi). These outputs could be class labels (hard labels) or probability/confidence scores (soft labels).
- Training Data Generation: Create a new dataset Dsurrogate={(x1,y1),(x2,y2),...,(xn,yn)}.
- Training the Surrogate Model: Train a new model, fsurrogate, on Dsurrogate. The attacker chooses the architecture for fsurrogate.
The effectiveness of this process heavily depends on the query selection strategy and the chosen surrogate model architecture.
The model stealing process involves an attacker selecting queries, sending them to the target model's API, collecting the responses, and using these input-output pairs to train a surrogate model.
Query Selection Strategies
The choice of queries Xquery is significant for the success of the extraction.
- Random Sampling: Queries can be drawn randomly from some prior distribution (e.g., uniform noise) or from a publicly available dataset assumed to be similar to the target's training data. This is simple but may not efficiently explore the decision boundaries of ftarget.
- Distribution-Based Sampling: If the attacker has access to unlabeled data representative of the target model's operational domain (e.g., publicly available images for an image classifier), querying with these inputs can yield a more realistic dataset Dsurrogate.
- Adaptive Querying (Active Learning): More sophisticated attacks use adaptive strategies. The attacker might initially train a rough surrogate model and then select new queries designed to improve it. Examples include:
- Uncertainty Sampling: Querying points where the current surrogate model is least confident.
- Boundary Exploration: Querying points near the decision boundary of the current surrogate model to refine it.
- Synthetic Data Generation: Using techniques like Generative Adversarial Networks (GANs) to generate informative queries based on the responses received so far.
Adaptive strategies generally require more interaction but can lead to higher fidelity surrogates with fewer queries compared to random sampling.
Surrogate Model Architecture
The attacker must choose an architecture for fsurrogate.
- Architecture Guessing: If the attacker has reason to believe ftarget uses a specific type of architecture (e.g., a ResNet-50 for image classification), they might choose the same or a similar one for fsurrogate.
- Standard Architectures: Often, the attacker simply uses a sufficiently powerful standard architecture relevant to the task domain (e.g., common CNNs, Transformers) without assuming it matches ftarget.
- Simpler Models: For tasks where simpler models might suffice, or if the query budget is very limited, attackers might use models like Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs), or even Decision Trees. The choice depends on the desired fidelity and the complexity suggested by the observed input-output behavior.
Hard Labels vs. Soft Labels
The type of information obtained from ftarget influences extraction quality:
- Hard Labels: Only the final predicted class is returned (e.g., "cat", "dog"). This provides less information per query.
- Soft Labels: The model returns confidence scores or probabilities for each class (e.g., {"cat": 0.9, "dog": 0.1}). These richer outputs significantly aid the training of fsurrogate, often leading to much faster and more accurate extraction. Training the surrogate to match the probability distribution (using loss functions like Kullback-Leibler divergence) is generally more effective than just matching the predicted class.
Equation Extraction
For very simple target models like linear or logistic regression, it's sometimes possible to extract the exact parameters (coefficients and intercept). This typically involves carefully chosen queries. For example, querying with basis vectors or specifically crafted inputs can allow the attacker to solve a system of linear equations to recover the model weights. However, these techniques are less applicable to complex, non-linear models like deep neural networks.
Challenges and Defenses
Model stealing faces several challenges:
- Query Budget: API rate limits, query costs, or detection mechanisms can limit the number of queries an attacker can make.
- Data Distribution: The attacker's query data might not perfectly match the distribution the target model operates on, leading to a surrogate that performs poorly on real-world data.
- Fidelity Measurement: It can be difficult for the attacker to accurately assess how well their surrogate truly mimics the target without access to a representative test set used by the target owner.
Potential defenses against model stealing include:
- API Rate Limiting and Cost: Making queries expensive or slow.
- Output Perturbation: Adding noise to predictions (e.g., via Differential Privacy) can make extraction harder, especially if only hard labels are returned.
- Watermarking: Embedding hidden patterns into the model's predictions that can identify stolen copies.
- Prediction Guarding: Returning only hard labels or quantized probabilities instead of full soft labels.
Connections to Other Concepts
Model stealing is closely related to:
- Transfer Attacks: A successfully stolen surrogate model can be used to craft adversarial examples that are likely to transfer to the original target model.
- Privacy: While not directly revealing individual training data points like membership inference, a highly accurate surrogate might leak aggregate information about the training distribution or biases learned by the model.
Model stealing represents a significant threat to the intellectual property invested in machine learning models and can serve as a stepping stone for other attacks. Understanding these techniques is essential for organizations deploying models via APIs, prompting consideration of defenses that balance usability with security.