Now that we have examined the theoretical underpinnings of several advanced evasion attacks, it's time to put theory into practice. This hands-on section will guide you through implementing two significant evasion attacks discussed earlier: Projected Gradient Descent (PGD) and Carlini & Wagner (C&W). We will use the Adversarial Robustness Toolbox (ART) library, a popular Python framework designed for evaluating the security of machine learning models. ART provides convenient abstractions for both attacks and defenses, integrating well with common deep learning frameworks like PyTorch and TensorFlow.
Working through these examples will solidify your understanding of how these attacks generate adversarial examples and how their parameters influence the outcome. We assume you have a Python environment set up with PyTorch (or TensorFlow) and ART installed.
First, ensure you have the necessary libraries installed. We'll use ART with PyTorch in this example. You can install it using pip:
pip install adversarial-robustness-toolbox[pytorch] torch torchvision
We also need a trained model and some data. For simplicity, let's use a pre-trained simple convolutional neural network (CNN) on the MNIST dataset. You would typically load your own trained model, but ART provides utilities for common datasets and basic models which are helpful for experimentation.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Define a simple CNN model (example)
class SimpleMNISTCNN(nn.Module):
def __init__(self):
super(SimpleMNISTCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.fc1 = nn.Linear(7*7*64, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2(x), 2))
x = x.view(-1, 7*7*64)
x = F.relu(self.fc1(x))
x = self.fc2(x)
# ART classifiers expect logit outputs
return x
# Load MNIST data
transform = transforms.Compose([transforms.ToTensor()])
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False)
# --- IMPORTANT ---
# For this practical, assume 'model' is a pre-trained instance of SimpleMNISTCNN
# and is set to evaluation mode: model.eval()
# For example:
# model = SimpleMNISTCNN()
# model.load_state_dict(torch.load('path/to/your/trained_mnist_cnn.pth'))
# model.eval()
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)
#
# You need to replace this with loading your actual trained model.
# For demonstration purposes, we'll proceed assuming 'model' exists.
# Let's create a placeholder model (untrained) just for code structure.
# !!! Replace this with your actual trained model loading !!!
model = SimpleMNISTCNN()
model.eval()
device = torch.device("cpu") # Use CPU for this example structure
model.to(device)
# !!! End of Placeholder !!!
# Define loss function and optimizer (needed for ART classifier wrapper)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Get a batch of test data
data_iter = iter(test_loader)
x_test_batch, y_test_batch = next(data_iter)
x_test_batch, y_test_batch = x_test_batch.to(device), y_test_batch.to(device)
# Convert to numpy for ART (some ART functions prefer numpy)
x_test_np = x_test_batch.cpu().numpy()
y_test_np = y_test_batch.cpu().numpy() # Keep as integer labels
ART requires wrapping your native model (PyTorch, TensorFlow, etc.) in an ART classifier object. This wrapper provides a standardized API for attacks and defenses.
from art.estimators.classification import PyTorchClassifier
# Wrap the PyTorch model with ART's PyTorchClassifier
classifier = PyTorchClassifier(
model=model,
loss=criterion,
optimizer=optimizer, # Optimizer might not be strictly needed for inference attacks but good practice
input_shape=(1, 28, 28), # MNIST image shape (channels, height, width)
nb_classes=10, # Number of classes in MNIST
clip_values=(0.0, 1.0) # Input data range (MNIST tensors are typically [0, 1])
)
Now classifier
is ready to be used with ART's attack implementations.
PGD is an iterative extension of FGSM. It takes multiple small steps in the gradient direction, projecting the result back onto the allowed perturbation space (Lp ball) after each step. This often finds more effective adversarial examples than single-step methods.
Key parameters for PGD:
norm
: The Lp norm to constrain the perturbation (e.g., np.inf
for L∞, 2
for L2). L∞ is common for images, limiting the maximum change per pixel.eps
: Maximum perturbation magnitude ϵ. Controls the "strength" of the attack.eps_step
: Step size for each iteration. Should be smaller than eps
.max_iter
: Number of iterations. More iterations can find better examples but take longer.targeted
: If True
, try to make the model predict a specific target class. If False
(default), try to cause any misclassification.Let's implement the L∞ PGD attack:
from art.attacks.evasion import ProjectedGradientDescent
# Configure the PGD attack
pgd_attack = ProjectedGradientDescent(
estimator=classifier,
norm=np.inf, # Use the L-infinity norm
eps=0.1, # Maximum perturbation (epsilon) - adjust based on model/data
eps_step=0.01, # Step size per iteration
max_iter=40, # Number of iterations
targeted=False, # Untargeted attack
num_random_init=1, # Use random initialization for robustness
batch_size=100
)
# Generate adversarial examples for the test batch
print("Generating PGD adversarial examples...")
x_test_adv_pgd = pgd_attack.generate(x=x_test_np)
print("PGD generation complete.")
# Evaluate the model on original and adversarial examples
predictions_clean = classifier.predict(x_test_np)
accuracy_clean = np.sum(np.argmax(predictions_clean, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on clean examples: {accuracy_clean * 100:.2f}%")
predictions_pgd = classifier.predict(x_test_adv_pgd)
accuracy_pgd = np.sum(np.argmax(predictions_pgd, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on PGD adversarial examples (eps={pgd_attack.eps:.2f}): {accuracy_pgd * 100:.2f}%")
# Calculate average L-infinity distortion
avg_linf_distortion_pgd = np.mean(np.max(np.abs(x_test_adv_pgd - x_test_np), axis=(1, 2, 3)))
print(f"Average L-infinity distortion (PGD): {avg_linf_distortion_pgd:.4f}")
You should observe a significant drop in accuracy on the adversarial examples compared to the clean ones, assuming your model wasn't specifically trained to be robust against PGD. The average L∞ distortion should be close to the specified eps
value.
The C&W attacks are optimization-based, framing the search for an adversarial example as a constrained optimization problem. The L2 version is particularly effective at finding perturbations with low L2 distance, meaning the overall magnitude of the change is minimized, often resulting in less visually perceptible perturbations compared to L∞ attacks at similar success rates.
Key parameters for C&W L2:
confidence
: Controls the desired gap between the logit of the incorrect class and the maximum logit of other classes. Higher values make the attack stronger but potentially increase distortion.learning_rate
: Learning rate for the optimization process.binary_search_steps
: Number of steps to find the optimal trade-off constant between distortion and classification loss.max_iter
: Maximum iterations for the optimization within each binary search step.batch_size
: Process examples in batches.Let's implement the C&W L2 attack:
from art.attacks.evasion import CarliniL2Method
# Configure the C&W L2 attack
# Note: C&W can be computationally expensive, especially with many iterations/steps.
# Reduce max_iter or binary_search_steps for faster execution if needed.
cw_attack = CarliniL2Method(
classifier=classifier,
confidence=0.0, # Minimum confidence gap
learning_rate=0.01, # Optimizer learning rate
binary_search_steps=5, # Number of binary search steps
max_iter=10, # Max iterations per binary search step
batch_size=100,
targeted=False # Untargeted attack
)
# Generate adversarial examples
print("Generating C&W L2 adversarial examples...")
# Warning: This can be slow! Consider running on a smaller subset of x_test_np for exploration.
# x_test_adv_cw = cw_attack.generate(x=x_test_np[:10]) # Example: Use first 10 samples
x_test_adv_cw = cw_attack.generate(x=x_test_np)
print("C&W L2 generation complete.")
# Evaluate the model on C&W examples
# If you used a subset above, evaluate only on that subset and corresponding y_test_np
# y_test_subset = y_test_np[:10]
# accuracy_cw = np.sum(np.argmax(classifier.predict(x_test_adv_cw), axis=1) == y_test_subset) / len(y_test_subset)
predictions_cw = classifier.predict(x_test_adv_cw)
accuracy_cw = np.sum(np.argmax(predictions_cw, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on C&W L2 adversarial examples: {accuracy_cw * 100:.2f}%")
# Calculate average L2 distortion
avg_l2_distortion_cw = np.mean(np.linalg.norm((x_test_adv_cw - x_test_np).reshape(len(x_test_np), -1), axis=1))
print(f"Average L2 distortion (C&W): {avg_l2_distortion_cw:.4f}")
Typically, C&W L2 achieves a high attack success rate (low accuracy) while often maintaining a lower average L2 distortion compared to what PGD might achieve for a similar success rate, although PGD constrained by L2 norm also exists. The trade-off is computational cost; C&W is significantly slower than PGD.
It's insightful to visualize the original image, the adversarial version, and the perturbation itself.
import matplotlib.pyplot as plt
# Select an example index (e.g., the first one)
idx = 0
# Ensure data is in the right format for matplotlib (H, W) or (H, W, C)
# Squeeze the channel dimension for MNIST
original_image = x_test_np[idx].squeeze()
adv_image_pgd = x_test_adv_pgd[idx].squeeze()
adv_image_cw = x_test_adv_cw[idx].squeeze()
perturbation_pgd = adv_image_pgd - original_image
perturbation_cw = adv_image_cw - original_image
# Get model predictions for this specific example
pred_orig_probs = F.softmax(torch.tensor(predictions_clean[idx:idx+1]), dim=1).detach().numpy().flatten()
pred_pgd_probs = F.softmax(torch.tensor(predictions_pgd[idx:idx+1]), dim=1).detach().numpy().flatten()
pred_cw_probs = F.softmax(torch.tensor(predictions_cw[idx:idx+1]), dim=1).detach().numpy().flatten()
pred_orig_label = np.argmax(pred_orig_probs)
pred_pgd_label = np.argmax(pred_pgd_probs)
pred_cw_label = np.argmax(pred_cw_probs)
true_label = y_test_np[idx]
# Plotting
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# Row 1: PGD
axes[0, 0].imshow(original_image, cmap='gray')
axes[0, 0].set_title(f"Original\nTrue: {true_label}, Pred: {pred_orig_label}\nConf: {pred_orig_probs[pred_orig_label]:.2f}")
axes[0, 0].axis('off')
# Enhance perturbation visibility for plotting
# Center the colormap around zero and use a diverging map
pert_vis_pgd = axes[0, 1].imshow(perturbation_pgd, cmap='coolwarm', vmin=-pgd_attack.eps, vmax=pgd_attack.eps)
axes[0, 1].set_title(f"PGD Perturbation ($L_\infty={np.max(np.abs(perturbation_pgd)):.3f}$)\n(Scaled Visually)")
axes[0, 1].axis('off')
fig.colorbar(pert_vis_pgd, ax=axes[0, 1], shrink=0.7)
axes[0, 2].imshow(adv_image_pgd, cmap='gray')
axes[0, 2].set_title(f"PGD Adversarial\nPred: {pred_pgd_label}\nConf: {pred_pgd_probs[pred_pgd_label]:.2f}")
axes[0, 2].axis('off')
# Row 2: C&W L2
axes[1, 0].imshow(original_image, cmap='gray')
axes[1, 0].set_title(f"Original\nTrue: {true_label}, Pred: {pred_orig_label}\nConf: {pred_orig_probs[pred_orig_label]:.2f}")
axes[1, 0].axis('off')
# Calculate L2 norm for the specific C&W perturbation
l2_pert_cw = np.linalg.norm(perturbation_cw.flatten())
pert_vis_cw = axes[1, 1].imshow(perturbation_cw, cmap='coolwarm', vmin=-np.abs(perturbation_cw).max(), vmax=np.abs(perturbation_cw).max())
axes[1, 1].set_title(f"C&W Perturbation ($L_2={l2_pert_cw:.3f}$)\n(Scaled Visually)")
axes[1, 1].axis('off')
fig.colorbar(pert_vis_cw, ax=axes[1, 1], shrink=0.7)
axes[1, 2].imshow(adv_image_cw, cmap='gray')
axes[1, 2].set_title(f"C&W L2 Adversarial\nPred: {pred_cw_label}\nConf: {pred_cw_probs[pred_cw_label]:.2f}")
axes[1, 2].axis('off')
plt.tight_layout()
plt.show()
Observe how the adversarial examples look very similar to the original to the human eye, yet the model's prediction changes significantly. Notice the structure and magnitude differences between the PGD (L∞) and C&W (L2) perturbations. PGD often utilizes the full ϵ budget on many pixels, while C&W L2 might make smoother, more distributed changes.
This practical provides a starting point for implementing evasion attacks. Consider these next steps:
eps
, eps_step
, max_iter
for PGD, and confidence
, learning_rate
, max_iter
, binary_search_steps
for C&W. Observe how these affect the attack success rate and the distortion (L∞, L2).norm=2
and compare its results (accuracy, L2 distortion) to PGD L∞ and C&W L2.targeted=True
. You will need to provide target labels (e.g., y_target = (y_test_np + 1) % 10
). Analyze if targeted attacks are harder or easier to generate.FastGradientMethod
), Basic Iterative Method (BIM - essentially PGD with num_random_init=0
), DeepFool (DeepFool
), or decision-based attacks like the Boundary Attack (BoundaryAttack
).eps
might need significant adjustment based on the dataset and input normalization.By implementing and experimenting with these foundational evasion attacks, you gain practical insight into the mechanics of crafting adversarial examples, which is essential for understanding both offensive capabilities and the challenges in building robust defenses.
© 2025 ApX Machine Learning