All Courses

Implementing Adversarial Training: Hands-on Practical

Alright, let's translate the theory of adversarial training into practice. As discussed earlier in the chapter, adversarial training aims to make models more resistant to evasion attacks by exposing them to adversarial examples during the training phase. The core idea revolves around solving a minimax problem:

\min_{\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\delta \in S} L(\theta, x+\delta, y) \right]

Here, the inner maximization finds the "worst-case" perturbation $\delta$ within a specified constraint set $S$ (typically an $L_p$ -norm ball, like $S = \{ \delta \mid \|\delta\|_\infty \le \epsilon \}$ ) for a given input $x$ and model parameters $\theta$ . The outer minimization then updates the model parameters $\theta$ to perform well even on these challenging perturbed inputs. Projected Gradient Descent (PGD) is the standard algorithm used to approximate the inner maximization.

This practical section will guide you through implementing PGD-based adversarial training, assuming you have a standard classification model and dataset ready. We'll use PyTorch for illustration, but the concepts translate directly to TensorFlow or other deep learning frameworks.

Setting the Stage

First, ensure you have the necessary libraries installed:

# Example using pip
pip install torch torchvision numpy

We'll assume you have a standard setup: a dataset (like CIFAR-10), a data loader, and a neural network model definition (e.g., a ResNet or a simpler CNN).

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np

# Assume device is set (cuda or cpu)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --- Data Loading (Standard CIFAR-10 example) ---
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

# Assume 'Net' is your model class definition
# Example: model = Net().to(device)
# Define loss function and optimizer
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)

Implementing the PGD Attack for Training

The core of adversarial training is generating adversarial examples on-the-fly for each batch. We need a function that takes the current model, a batch of inputs ( $x$ ) and labels ( $y$ ), and generates the corresponding adversarial examples ( $x_{adv}$ ).

Here's a PGD implementation specifically for generating perturbations during training:

def pgd_attack(model, images, labels, criterion, epsilon=8/255, alpha=2/255, iters=7):
    """
    Constructs PGD adversarial examples on the fly.
    Args:
        model: The model to attack.
        images: Clean input images (batch).
        labels: True labels for the images.
        criterion: Loss function (e.g., nn.CrossEntropyLoss).
        epsilon: Maximum L_infinity perturbation magnitude.
        alpha: Step size for each iteration.
        iters: Number of PGD iterations.
    Returns:
        Adversarial images (batch).
    """
    images = images.detach().clone().to(device)
    labels = labels.detach().clone().to(device)

    # Start with random perturbation
    delta = torch.rand_like(images) * (2 * epsilon) - epsilon
    delta.requires_grad = True

    perturbed_images = torch.clamp(images + delta, min=0, max=1).detach() # Initial perturbed image

    for _ in range(iters):
        perturbed_images.requires_grad = True
        outputs = model(perturbed_images)
        model.zero_grad()
        loss = criterion(outputs, labels)
        loss.backward()

        # Update perturbation using the sign of the gradient
        grad_sign = perturbed_images.grad.detach().sign()
        delta = delta.detach() + alpha * grad_sign

        # Project delta back into the L_infinity ball
        delta = torch.clamp(delta, -epsilon, epsilon)

        # Add perturbation and clamp final image to [0, 1]
        perturbed_images = torch.clamp(images + delta, min=0, max=1).detach()

    return perturbed_images

Important points in the pgd_attack function:

Detaching Inputs: We detach() the input images and labels to prevent gradients from flowing back into the previous training iteration's computations. We clone() them to avoid modifying the original batch data.
Random Initialization: PGD often works better when starting the perturbation from a random point within the $\epsilon$ -ball around the clean image. This helps avoid getting stuck in poor local optima.
Iterative Process: The attack iteratively takes steps (alpha) in the direction of the gradient sign of the loss with respect to the perturbed image.
Gradient Calculation: Inside the loop, perturbed_images.requires_grad = True is important to compute gradients with respect to the current perturbed input.
Projection: After each step, the perturbation delta is clipped (projected) back into the allowed $L_\infty$ ball defined by epsilon.
Clamping: The final perturbed_images are clamped to the valid pixel range (e.g., $[0, 1]$ for normalized images).
Detaching Output: The function returns detached adversarial images, as they will serve as input for the model's forward pass during the training step, and we don't need gradients flowing back through the attack process itself during the main model optimization.

Modifying the Training Loop

Now, integrate the pgd_attack function into your standard training loop. Instead of computing the loss only on the clean batch, you compute it on the adversarial versions generated by PGD.

def train_adversarial(epoch, model, trainloader, optimizer, criterion, epsilon, alpha, iters):
    """ Performs one epoch of adversarial training. """
    model.train() # Set model to training mode
    train_loss = 0
    correct = 0
    total = 0

    print(f'\nEpoch: {epoch}')
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)

        # 1. Generate adversarial examples for the current batch
        adv_inputs = pgd_attack(model, inputs, targets, criterion,
                                epsilon=epsilon, alpha=alpha, iters=iters)

        # 2. Standard training step using adversarial examples
        optimizer.zero_grad()
        outputs = model(adv_inputs) # Use adversarial inputs
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        # --- Logging ---
        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

        if batch_idx % 100 == 0: # Print progress every 100 batches
            print(f'Batch: {batch_idx+1}/{len(trainloader)} | Loss: {train_loss/(batch_idx+1):.3f} | Acc: {100.*correct/total:.3f}% ({correct}/{total})')

# --- Example Training Call ---
# Assume model, optimizer, criterion are defined
N_EPOCHS = 10 # Example
EPSILON = 8/255 # Standard for CIFAR-10 L_inf
ALPHA = 2/255
PGD_ITERS = 7

for epoch in range(N_EPOCHS):
    train_adversarial(epoch, model, trainloader, optimizer, criterion,
                      epsilon=EPSILON, alpha=ALPHA, iters=PGD_ITERS)
    # Add validation/testing steps here (both clean and adversarial)

Evaluating Robustness

After training, you must evaluate your model's performance not only on the standard, clean test set but also against adversarial attacks (preferably strong ones, potentially different from the one used during training, like PGD with more steps or C&W).

A typical evaluation involves:

Calculating accuracy on the clean test set.
Calculating accuracy on adversarial examples generated from the test set using a defined attack (e.g., PGD- $k$ , where $k$ might be 20 or more).

You should observe that adversarial training significantly improves accuracy against the attack it was trained on (and often related attacks), but usually comes at the cost of a slight decrease in accuracy on clean, unperturbed data. This is a well-known trade-off.

Comparison of model accuracy on clean test data versus data perturbed by a PGD attack (20 iterations, $\epsilon=8/255$ ). Adversarial training improves accuracy significantly but slightly reduces clean accuracy.

Considerations

Hyperparameters: The choice of $\epsilon$ , $\alpha$ , and the number of PGD iterations (iters) during training significantly impacts the results. Common values for CIFAR-10 with $L_\infty$ are $\epsilon=8/255$ , $\alpha=2/255$ , and $iters=7$ or $10$ . These may need tuning for different datasets or model architectures.
Computational Cost: Adversarial training is significantly more expensive than standard training because it requires multiple forward and backward passes per batch just to generate the adversarial examples.
Attack Strength: Using a weak attack (e.g., FGSM or PGD with very few steps) during training might lead to a false sense of security. PGD with sufficient iterations is considered a strong baseline.
Mixing Clean and Adversarial Data: Some variations of adversarial training involve training on a mix of clean and adversarial samples within each batch, potentially balancing the robustness vs. clean accuracy trade-off differently. The example above uses only adversarial samples, which is a common PGD-AT approach.

This hands-on guide provides the foundation for implementing adversarial training. By incorporating strong attacks like PGD directly into the learning process, you can build models that are demonstrably more resilient against specific types of adversarial manipulations encountered during inference. Remember to rigorously evaluate your trained models against various attacks to understand their true robustness characteristics.

Was this section helpful?