Alright, let's translate the theory of adversarial training into practice. As discussed earlier in the chapter, adversarial training aims to make models more resistant to evasion attacks by exposing them to adversarial examples during the training phase. The core idea revolves around solving a minimax problem:
θminE(x,y)∼D[δ∈SmaxL(θ,x+δ,y)]Here, the inner maximization finds the "worst-case" perturbation δ within a specified constraint set S (typically an Lp-norm ball, like S={δ∣∥δ∥∞≤ϵ}) for a given input x and model parameters θ. The outer minimization then updates the model parameters θ to perform well even on these challenging perturbed inputs. Projected Gradient Descent (PGD) is the standard algorithm used to approximate the inner maximization.
This practical section will guide you through implementing PGD-based adversarial training, assuming you have a standard classification model and dataset ready. We'll use PyTorch for illustration, but the concepts translate directly to TensorFlow or other deep learning frameworks.
First, ensure you have the necessary libraries installed:
# Example using pip
pip install torch torchvision numpy
We'll assume you have a standard setup: a dataset (like CIFAR-10), a data loader, and a neural network model definition (e.g., a ResNet or a simpler CNN).
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
# Assume device is set (cuda or cpu)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# --- Data Loading (Standard CIFAR-10 example) ---
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
# Assume 'Net' is your model class definition
# Example: model = Net().to(device)
# Define loss function and optimizer
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
The core of adversarial training is generating adversarial examples on-the-fly for each batch. We need a function that takes the current model, a batch of inputs (x) and labels (y), and generates the corresponding adversarial examples (xadv).
Here's a PGD implementation specifically for generating perturbations during training:
def pgd_attack(model, images, labels, criterion, epsilon=8/255, alpha=2/255, iters=7):
"""
Constructs PGD adversarial examples on the fly.
Args:
model: The model to attack.
images: Clean input images (batch).
labels: True labels for the images.
criterion: Loss function (e.g., nn.CrossEntropyLoss).
epsilon: Maximum L_infinity perturbation magnitude.
alpha: Step size for each iteration.
iters: Number of PGD iterations.
Returns:
Adversarial images (batch).
"""
images = images.detach().clone().to(device)
labels = labels.detach().clone().to(device)
# Start with random perturbation
delta = torch.rand_like(images) * (2 * epsilon) - epsilon
delta.requires_grad = True
perturbed_images = torch.clamp(images + delta, min=0, max=1).detach() # Initial perturbed image
for _ in range(iters):
perturbed_images.requires_grad = True
outputs = model(perturbed_images)
model.zero_grad()
loss = criterion(outputs, labels)
loss.backward()
# Update perturbation using the sign of the gradient
grad_sign = perturbed_images.grad.detach().sign()
delta = delta.detach() + alpha * grad_sign
# Project delta back into the L_infinity ball
delta = torch.clamp(delta, -epsilon, epsilon)
# Add perturbation and clamp final image to [0, 1]
perturbed_images = torch.clamp(images + delta, min=0, max=1).detach()
return perturbed_images
Key points in the pgd_attack
function:
detach()
the input images
and labels
to prevent gradients from flowing back into the previous training iteration's computations. We clone()
them to avoid modifying the original batch data.alpha
) in the direction of the gradient sign of the loss with respect to the perturbed image.perturbed_images.requires_grad = True
is important to compute gradients with respect to the current perturbed input.delta
is clipped (projected) back into the allowed L∞ ball defined by epsilon
.perturbed_images
are clamped to the valid pixel range (e.g., [0,1] for normalized images).Now, integrate the pgd_attack
function into your standard training loop. Instead of computing the loss only on the clean batch, you compute it on the adversarial versions generated by PGD.
def train_adversarial(epoch, model, trainloader, optimizer, criterion, epsilon, alpha, iters):
""" Performs one epoch of adversarial training. """
model.train() # Set model to training mode
train_loss = 0
correct = 0
total = 0
print(f'\nEpoch: {epoch}')
for batch_idx, (inputs, targets) in enumerate(trainloader):
inputs, targets = inputs.to(device), targets.to(device)
# 1. Generate adversarial examples for the current batch
adv_inputs = pgd_attack(model, inputs, targets, criterion,
epsilon=epsilon, alpha=alpha, iters=iters)
# 2. Standard training step using adversarial examples
optimizer.zero_grad()
outputs = model(adv_inputs) # Use adversarial inputs
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# --- Logging ---
train_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
if batch_idx % 100 == 0: # Print progress every 100 batches
print(f'Batch: {batch_idx+1}/{len(trainloader)} | Loss: {train_loss/(batch_idx+1):.3f} | Acc: {100.*correct/total:.3f}% ({correct}/{total})')
# --- Example Training Call ---
# Assume model, optimizer, criterion are defined
N_EPOCHS = 10 # Example
EPSILON = 8/255 # Standard for CIFAR-10 L_inf
ALPHA = 2/255
PGD_ITERS = 7
for epoch in range(N_EPOCHS):
train_adversarial(epoch, model, trainloader, optimizer, criterion,
epsilon=EPSILON, alpha=ALPHA, iters=PGD_ITERS)
# Add validation/testing steps here (both clean and adversarial)
After training, you must evaluate your model's performance not only on the standard, clean test set but also against adversarial attacks (preferably strong ones, potentially different from the one used during training, like PGD with more steps or C&W).
A typical evaluation involves:
You should observe that adversarial training significantly improves accuracy against the attack it was trained on (and often related attacks), but usually comes at the cost of a slight decrease in accuracy on clean, unperturbed data. This is a well-known trade-off.
Comparison of model accuracy on clean test data versus data perturbed by a PGD attack (20 iterations, ϵ=8/255). Adversarial training improves robust accuracy significantly but slightly reduces clean accuracy.
iters
) during training significantly impacts the results. Common values for CIFAR-10 with L∞ are ϵ=8/255, α=2/255, and iters=7 or 10. These may need tuning for different datasets or model architectures.This hands-on guide provides the foundation for implementing adversarial training. By incorporating strong attacks like PGD directly into the learning process, you can build models that are demonstrably more resilient against specific types of adversarial manipulations encountered during inference. Remember to rigorously evaluate your trained models against various attacks to understand their true robustness characteristics.
© 2025 ApX Machine Learning