Now that we understand the importance of metrics, benchmarks, and adaptive attacks, let's get our hands dirty by setting up a practical robustness benchmark. This section will guide you through using a standard framework to evaluate a model's resilience against common adversarial attacks. We'll focus on the process, illustrating the key steps involved in configuring and running such an evaluation.
Several excellent open-source libraries streamline the process of implementing attacks, defenses, and evaluations. Popular choices include:
For this walk-through, we'll lean on the structure provided by libraries like ART, given its framework-agnostic nature and extensive features. The core steps, however, are applicable regardless of the specific library you choose.
Before starting, ensure you have a working Python environment with your preferred deep learning library (TensorFlow or PyTorch) installed. You'll also need to install the evaluation framework. For example, installing ART is typically straightforward using pip:
pip install adversarial-robustness-toolbox
You might need additional dependencies depending on your chosen deep learning backend (e.g., tensorflow
or torch
). Refer to the specific framework's documentation for detailed installation instructions.
A well-defined benchmark requires specifying several key components upfront. Let's outline a typical scenario:
Model Under Test: We need a trained model. For reproducibility and comparison, it's common practice to use standard pre-trained models available in libraries like torchvision.models
or tf.keras.applications
. Let's assume we're evaluating a ResNet-18 model pre-trained on the CIFAR-10 dataset.
Dataset: We'll use the standard test split of the dataset the model was trained on, in this case, the CIFAR-10 test set. Evaluating on the correct dataset is fundamental for meaningful results.
Threat Model: We need to define the attacker's capabilities and goals. A common setup for initial benchmarking is:
Attacks: Select a representative set of attacks. For a basic benchmark, good starting points are:
Evaluation Metrics: The primary metric will be the model's accuracy on the adversarial examples generated by each attack. We'll compare this to the model's accuracy on the original, unperturbed test data (clean accuracy).
Using a library like ART, the typical workflow involves these steps:
Load Data and Model: Load the CIFAR-10 test dataset and the pre-trained ResNet-18 model using your chosen deep learning framework (PyTorch or TensorFlow/Keras). Ensure the model is in evaluation mode and preprocessing steps (like normalization) are correctly applied.
Wrap the Model: Adversarial libraries often require wrapping your native model in their specific classifier object. This wrapper standardizes the interface for applying attacks and defenses. For ART, you would use PyTorchClassifier
or TensorFlowV2Classifier
.
import torch
from art.estimators.classification import PyTorchClassifier
import torchvision.models as models
# Assume 'model' is your loaded pre-trained ResNet-18
# Assume 'criterion' is your loss function (e.g., CrossEntropyLoss)
# Assume 'optimizer' is defined (though not needed for inference)
# Define input shape and number of classes
input_shape = (3, 32, 32) # CIFAR-10
nb_classes = 10
# Define preprocessing (mean/std used during training)
mean = [0.4914, 0.4822, 0.4465]
std = [0.2023, 0.1994, 0.2010]
preprocessing = (mean, std)
# Create the ART classifier wrapper
art_classifier = PyTorchClassifier(
model=model,
loss=criterion,
input_shape=input_shape,
nb_classes=nb_classes,
preprocessing=preprocessing, # Important for applying attacks correctly
clip_values=(0.0, 1.0) # Assuming data is scaled to [0, 1]
)
```
*Key Point*: Providing correct `preprocessing` and `clip_values` to the wrapper is significant. Attacks operate on the input data, and the framework needs to know how to handle normalization and data ranges.
3. Instantiate Attacks: Create instances of the attacks you selected, configuring them with the parameters defined in your threat model.
```python
# Example using ART
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent
# FGSM attack instance
fgsm_attack = FastGradientMethod(
estimator=art_classifier,
norm='inf', # Corresponds to Linf
eps=8/255,
targeted=False
)
# PGD attack instance
pgd_attack = ProjectedGradientDescent(
estimator=art_classifier,
norm='inf',
eps=8/255,
eps_step=2/255, # Step size alpha
max_iter=10, # Number of iterations
targeted=False,
verbose=False # Suppress progress bars during generation
)
attacks = {"FGSM": fgsm_attack, "PGD_10": pgd_attack}
```
4. Run Evaluation Loop: Iterate through the test dataset (or a reasonably sized subset for faster evaluation). For each batch of clean images:
* Evaluate the model's accuracy on the clean batch.
* For each configured attack:
* Generate adversarial examples using the attack.generate(x=clean_batch)
method.
* Evaluate the model's accuracy on the generated adversarial batch.
* Aggregate the accuracy scores.
```python
# Assume 'test_loader' provides batches of (images, labels)
clean_correct = 0
adv_correct = {name: 0 for name in attacks.keys()}
total = 0
for images, labels in test_loader:
# Ensure images are on the correct device (CPU/GPU)
# Format labels if necessary (e.g., to one-hot)
# Evaluate clean accuracy
clean_preds = art_classifier.predict(images)
clean_correct += (torch.argmax(clean_preds, dim=1) == labels).sum().item()
# Evaluate adversarial accuracy for each attack
for name, attack in attacks.items():
adv_images = attack.generate(x=images, y=labels) # y sometimes helps stabilize untargeted attacks
adv_preds = art_classifier.predict(adv_images)
adv_correct[name] += (torch.argmax(adv_preds, dim=1) == labels).sum().item()
total += labels.size(0)
clean_accuracy = 100.0 * clean_correct / total
adv_accuracy = {name: 100.0 * count / total for name, count in adv_correct.items()}
print(f"Clean Accuracy: {clean_accuracy:.2f}%")
for name, acc in adv_accuracy.items():
print(f"Accuracy under {name} (eps={8/255:.3f}): {acc:.2f}%")
```
The output of the loop provides the core results: clean accuracy vs. accuracy under attack.
You might present these results in a simple table or a bar chart:
Comparison of model accuracy on original CIFAR-10 test images versus accuracy on adversarial examples generated by FGSM and PGD (L∞, ϵ=8/255).
Interpretation: The chart above clearly shows a significant drop in accuracy when the model faces adversarial examples, especially under the stronger PGD attack. An accuracy of 38.2% under PGD suggests considerable vulnerability for this model under this specific threat model.
Reporting: When reporting benchmark results, always clearly state:
This hands-on walk-through provides a template for setting up basic robustness benchmarks. From here, you can:
Systematic benchmarking, using standardized tools and clear reporting, is fundamental for understanding the true security posture of your machine learning models and for comparing the effectiveness of different defense strategies.
© 2025 ApX Machine Learning