All Courses

Implementing Evasion Attacks: Hands-on Practical

Now that we have examined the theoretical underpinnings of several advanced evasion attacks, it's time to put theory into practice. This hands-on section will guide you through implementing two significant evasion attacks discussed earlier: Projected Gradient Descent (PGD) and Carlini & Wagner (C&W). We will use the Adversarial Robustness Toolbox (ART) library, a popular Python framework designed for evaluating the security of machine learning models. ART provides convenient abstractions for both attacks and defenses, integrating well with common deep learning frameworks like PyTorch and TensorFlow.

Working through these examples will solidify your understanding of how these attacks generate adversarial examples and how their parameters influence the outcome. We assume you have a Python environment set up with PyTorch (or TensorFlow) and ART installed.

Environment Setup

First, ensure you have the necessary libraries installed. We'll use ART with PyTorch in this example. You can install it using pip:

pip install adversarial-robustness-toolbox[pytorch] torch torchvision

We also need a trained model and some data. For simplicity, let's use a pre-trained simple convolutional neural network (CNN) on the MNIST dataset. You would typically load your own trained model, but ART provides utilities for common datasets and basic models which are helpful for experimentation.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define a simple CNN model (example)
class SimpleMNISTCNN(nn.Module):
    def __init__(self):
        super(SimpleMNISTCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(7*7*64, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 7*7*64)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        # ART classifiers expect logit outputs
        return x

# Load MNIST data
transform = transforms.Compose([transforms.ToTensor()])
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False)

# --- IMPORTANT ---
# For this practical, assume 'model' is a pre-trained instance of SimpleMNISTCNN
# and is set to evaluation mode: model.eval()
# For example:
# model = SimpleMNISTCNN()
# model.load_state_dict(torch.load('path/to/your/trained_mnist_cnn.pth'))
# model.eval()
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)
#
# You need to replace this with loading your actual trained model.
# For demonstration purposes, we'll proceed assuming 'model' exists.
# Let's create a placeholder model (untrained) just for code structure.
# !!! Replace this with your actual trained model loading !!!
model = SimpleMNISTCNN()
model.eval()
device = torch.device("cpu") # Use CPU for this example structure
model.to(device)
# !!! End of Placeholder !!!

# Define loss function and optimizer (needed for ART classifier wrapper)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Get a batch of test data
data_iter = iter(test_loader)
x_test_batch, y_test_batch = next(data_iter)
x_test_batch, y_test_batch = x_test_batch.to(device), y_test_batch.to(device)

# Convert to numpy for ART (some ART functions prefer numpy)
x_test_np = x_test_batch.cpu().numpy()
y_test_np = y_test_batch.cpu().numpy() # Keep as integer labels

Wrapping the Model with ART

ART requires wrapping your native model (PyTorch, TensorFlow, etc.) in an ART classifier object. This wrapper provides a standardized API for attacks and defenses.

from art.estimators.classification import PyTorchClassifier

# Wrap the PyTorch model with ART's PyTorchClassifier
classifier = PyTorchClassifier(
    model=model,
    loss=criterion,
    optimizer=optimizer, # Optimizer might not be strictly needed for inference attacks but good practice
    input_shape=(1, 28, 28), # MNIST image shape (channels, height, width)
    nb_classes=10,         # Number of classes in MNIST
    clip_values=(0.0, 1.0) # Input data range (MNIST tensors are typically [0, 1])
)

Now classifier is ready to be used with ART's attack implementations.

Implementing Projected Gradient Descent (PGD)

PGD is an iterative extension of FGSM. It takes multiple small steps in the gradient direction, projecting the result back onto the allowed perturbation space ( $L_p$ ball) after each step. This often finds more effective adversarial examples than single-step methods.

Important parameters for PGD:

norm: The $L_p$ norm to constrain the perturbation (e.g., np.inf for $L_\infty$ , 2 for $L_2$ ). $L_\infty$ is common for images, limiting the maximum change per pixel.
eps: Maximum perturbation magnitude $\epsilon$ . Controls the "strength" of the attack.
eps_step: Step size for each iteration. Should be smaller than eps.
max_iter: Number of iterations. More iterations can find better examples but take longer.
targeted: If True, try to make the model predict a specific target class. If False (default), try to cause any misclassification.

Let's implement the $L_\infty$ PGD attack:

from art.attacks.evasion import ProjectedGradientDescent

# Configure the PGD attack
pgd_attack = ProjectedGradientDescent(
    estimator=classifier,
    norm=np.inf,        # Use the L-infinity norm
    eps=0.1,            # Maximum perturbation (epsilon) - adjust based on model/data
    eps_step=0.01,      # Step size per iteration
    max_iter=40,        # Number of iterations
    targeted=False,     # Untargeted attack
    num_random_init=1,  # Use random initialization for robustness
    batch_size=100
)

# Generate adversarial examples for the test batch
print("Generating PGD adversarial examples...")
x_test_adv_pgd = pgd_attack.generate(x=x_test_np)
print("PGD generation complete.")

# Evaluate the model on original and adversarial examples
predictions_clean = classifier.predict(x_test_np)
accuracy_clean = np.sum(np.argmax(predictions_clean, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on clean examples: {accuracy_clean * 100:.2f}%")

predictions_pgd = classifier.predict(x_test_adv_pgd)
accuracy_pgd = np.sum(np.argmax(predictions_pgd, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on PGD adversarial examples (eps={pgd_attack.eps:.2f}): {accuracy_pgd * 100:.2f}%")

# Calculate average L-infinity distortion
avg_linf_distortion_pgd = np.mean(np.max(np.abs(x_test_adv_pgd - x_test_np), axis=(1, 2, 3)))
print(f"Average L-infinity distortion (PGD): {avg_linf_distortion_pgd:.4f}")

You should observe a significant drop in accuracy on the adversarial examples compared to the clean ones, assuming your model wasn't specifically trained to be robust against PGD. The average $L_\infty$ distortion should be close to the specified eps value.

Implementing Carlini & Wagner (C&W) L2 Attack

The C&W attacks are optimization-based, framing the search for an adversarial example as a constrained optimization problem. The $L_2$ version is particularly effective at finding perturbations with low $L_2$ distance, meaning the overall magnitude of the change is minimized, often resulting in less visually perceptible perturbations compared to $L_\infty$ attacks at similar success rates.

Parameters for C&W $L_2$ :

confidence: Controls the desired gap between the logit of the incorrect class and the maximum logit of other classes. Higher values make the attack stronger but potentially increase distortion.
learning_rate: Learning rate for the optimization process.
binary_search_steps: Number of steps to find the optimal trade-off constant between distortion and classification loss.
max_iter: Maximum iterations for the optimization within each binary search step.
batch_size: Process examples in batches.

Let's implement the C&W $L_2$ attack:

from art.attacks.evasion import CarliniL2Method

# Configure the C&W L2 attack
# Note: C&W can be computationally expensive, especially with many iterations/steps.
# Reduce max_iter or binary_search_steps for faster execution if needed.
cw_attack = CarliniL2Method(
    classifier=classifier,
    confidence=0.0,         # Minimum confidence gap
    learning_rate=0.01,     # Optimizer learning rate
    binary_search_steps=5,  # Number of binary search steps
    max_iter=10,            # Max iterations per binary search step
    batch_size=100,
    targeted=False          # Untargeted attack
)

# Generate adversarial examples
print("Generating C&W L2 adversarial examples...")
# Warning: This can be slow! Consider running on a smaller subset of x_test_np for exploration.
# x_test_adv_cw = cw_attack.generate(x=x_test_np[:10]) # Example: Use first 10 samples
x_test_adv_cw = cw_attack.generate(x=x_test_np)
print("C&W L2 generation complete.")

# Evaluate the model on C&W examples
# If you used a subset above, evaluate only on that subset and corresponding y_test_np
# y_test_subset = y_test_np[:10]
# accuracy_cw = np.sum(np.argmax(classifier.predict(x_test_adv_cw), axis=1) == y_test_subset) / len(y_test_subset)

predictions_cw = classifier.predict(x_test_adv_cw)
accuracy_cw = np.sum(np.argmax(predictions_cw, axis=1) == y_test_np) / len(y_test_np)
print(f"Accuracy on C&W L2 adversarial examples: {accuracy_cw * 100:.2f}%")

# Calculate average L2 distortion
avg_l2_distortion_cw = np.mean(np.linalg.norm((x_test_adv_cw - x_test_np).reshape(len(x_test_np), -1), axis=1))
print(f"Average L2 distortion (C&W): {avg_l2_distortion_cw:.4f}")

Typically, C&W $L_2$ achieves a high attack success rate (low accuracy) while often maintaining a lower average $L_2$ distortion compared to what PGD might achieve for a similar success rate, although PGD constrained by $L_2$ norm also exists. The trade-off is computational cost; C&W is significantly slower than PGD.

Visualizing the Perturbations

It's insightful to visualize the original image, the adversarial version, and the perturbation itself.

import matplotlib.pyplot as plt

# Select an example index (e.g., the first one)
idx = 0

# Ensure data is in the right format for matplotlib (H, W) or (H, W, C)
# Squeeze the channel dimension for MNIST
original_image = x_test_np[idx].squeeze()
adv_image_pgd = x_test_adv_pgd[idx].squeeze()
adv_image_cw = x_test_adv_cw[idx].squeeze()

perturbation_pgd = adv_image_pgd - original_image
perturbation_cw = adv_image_cw - original_image

# Get model predictions for this specific example
pred_orig_probs = F.softmax(torch.tensor(predictions_clean[idx:idx+1]), dim=1).detach().numpy().flatten()
pred_pgd_probs = F.softmax(torch.tensor(predictions_pgd[idx:idx+1]), dim=1).detach().numpy().flatten()
pred_cw_probs = F.softmax(torch.tensor(predictions_cw[idx:idx+1]), dim=1).detach().numpy().flatten()

pred_orig_label = np.argmax(pred_orig_probs)
pred_pgd_label = np.argmax(pred_pgd_probs)
pred_cw_label = np.argmax(pred_cw_probs)

true_label = y_test_np[idx]

# Plotting
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

# Row 1: PGD
axes[0, 0].imshow(original_image, cmap='gray')
axes[0, 0].set_title(f"Original\nTrue: {true_label}, Pred: {pred_orig_label}\nConf: {pred_orig_probs[pred_orig_label]:.2f}")
axes[0, 0].axis('off')

# Enhance perturbation visibility for plotting
# Center the colormap around zero and use a diverging map
pert_vis_pgd = axes[0, 1].imshow(perturbation_pgd, cmap='coolwarm', vmin=-pgd_attack.eps, vmax=pgd_attack.eps)
axes[0, 1].set_title(f"PGD Perturbation ($L_\infty={np.max(np.abs(perturbation_pgd)):.3f}$)\n(Scaled Visually)")
axes[0, 1].axis('off')
fig.colorbar(pert_vis_pgd, ax=axes[0, 1], shrink=0.7)


axes[0, 2].imshow(adv_image_pgd, cmap='gray')
axes[0, 2].set_title(f"PGD Adversarial\nPred: {pred_pgd_label}\nConf: {pred_pgd_probs[pred_pgd_label]:.2f}")
axes[0, 2].axis('off')

# Row 2: C&W L2
axes[1, 0].imshow(original_image, cmap='gray')
axes[1, 0].set_title(f"Original\nTrue: {true_label}, Pred: {pred_orig_label}\nConf: {pred_orig_probs[pred_orig_label]:.2f}")
axes[1, 0].axis('off')

# Calculate L2 norm for the specific C&W perturbation
l2_pert_cw = np.linalg.norm(perturbation_cw.flatten())
pert_vis_cw = axes[1, 1].imshow(perturbation_cw, cmap='coolwarm', vmin=-np.abs(perturbation_cw).max(), vmax=np.abs(perturbation_cw).max())
axes[1, 1].set_title(f"C&W Perturbation ($L_2={l2_pert_cw:.3f}$)\n(Scaled Visually)")
axes[1, 1].axis('off')
fig.colorbar(pert_vis_cw, ax=axes[1, 1], shrink=0.7)


axes[1, 2].imshow(adv_image_cw, cmap='gray')
axes[1, 2].set_title(f"C&W L2 Adversarial\nPred: {pred_cw_label}\nConf: {pred_cw_probs[pred_cw_label]:.2f}")
axes[1, 2].axis('off')


plt.tight_layout()
plt.show()

Observe how the adversarial examples look very similar to the original to the human eye, yet the model's prediction changes significantly. Notice the structure and magnitude differences between the PGD ( $L_\infty$ ) and C&W ( $L_2$ ) perturbations. PGD often utilizes the full $\epsilon$ budget on many pixels, while C&W $L_2$ might make smoother, more distributed changes.

Exploration and Next Steps

This practical provides a starting point for implementing evasion attacks. Consider these next steps:

Vary Parameters: Experiment with different eps, eps_step, max_iter for PGD, and confidence, learning_rate, max_iter, binary_search_steps for C&W. Observe how these affect the attack success rate and the distortion ( $L_\infty$ , $L_2$ ).
Different Norms: Implement PGD with norm=2 and compare its results (accuracy, L2 distortion) to PGD $L_\infty$ and C&W $L_2$ .
Targeted Attacks: Modify the attacks to be targeted=True. You will need to provide target labels (e.g., y_target = (y_test_np + 1) % 10). Analyze if targeted attacks are harder or easier to generate.
Other Attacks: Explore other evasion attacks available in ART, such as FGSM (FastGradientMethod), Basic Iterative Method (BIM - essentially PGD with num_random_init=0), DeepFool (DeepFool), or decision-based attacks like the Boundary Attack (BoundaryAttack).
Different Models/Datasets: Apply these attacks to more complex models (e.g., ResNets) and datasets (e.g., CIFAR-10). Note that parameters like eps might need significant adjustment based on the dataset and input normalization.

By implementing and experimenting with these foundational attacks, you gain practical insight into the mechanics of crafting adversarial examples, which is essential for understanding both offensive capabilities and the challenges in building defenses.

Was this section helpful?