While gradient-based attacks like Projected Gradient Descent (PGD) provide a strong baseline for generating adversarial examples, they primarily focus on finding any adversarial example within a predefined perturbation budget (ϵ-ball). They don't necessarily find the example with the smallest possible perturbation that achieves misclassification. For situations where minimizing the perceptibility of the perturbation is paramount, or when evaluating the absolute breaking point of a model, optimization-based attacks offer a more direct approach.
The Carlini & Wagner (C&W) attacks, introduced by Nicholas Carlini and David Wagner, represent a powerful class of optimization-based evasion attacks. Instead of just finding an adversarial example within a constraint, C&W methods frame the search as a formal optimization problem: find the minimum distortion δ such that xadv=x+δ is misclassified and satisfies input constraints (e.g., pixel values remain valid).
Formulating the Optimization Problem
The core idea is to minimize the perturbation size, measured by an Lp norm, subject to the condition that the model misclassifies the perturbed input. Let x be the original input, x′ be the adversarial input (xadv), and f be the model (often represented by its logits Z(x′) before the final softmax). The goal is to find x′ that solves:
minimize ∣∣x′−x∣∣psubject to f(x′) is misclassified and x′∈[0,1]n
Directly enforcing the "misclassified" constraint is difficult in standard optimization frameworks. C&W cleverly replace this constraint with a specially designed loss function, g(x′), which is minimized when the model misclassifies x′. The optimization problem then becomes:
minimize ∣∣x′−x∣∣p+c⋅g(x′)
Here:
- ∣∣x′−x∣∣p is the distortion term, measuring the distance between the original and adversarial input using an Lp norm (commonly L0, L2, or L∞).
- g(x′) is the classification loss function.
- c is a balancing constant, carefully chosen (often via binary search) to ensure the classification constraint is met while keeping the distortion low. A larger c places more emphasis on achieving misclassification.
The C&W Classification Loss
The genius of the C&W attack lies in the design of g(x′). For a targeted attack, where the goal is to make the model classify x′ as a specific target class t, a common loss function is:
g(x′)=max(i=tmaxZ(x′)i−Z(x′)t,−κ)
Let's break this down:
- Z(x′)i is the logit value for class i.
- maxi=tZ(x′)i finds the highest logit value among all classes except the target class t.
- Z(x′)t is the logit value for the target class t.
- The difference maxi=tZ(x′)i−Z(x′)t measures how much "stronger" the most likely incorrect class is compared to the target class. We want this difference to be negative and large (in magnitude) for the target class t to be the most likely prediction.
- The outer max(…,−κ) introduces a confidence margin κ≥0. The loss becomes negative only when Z(x′)t is greater than all other logits by at least κ. Minimizing this loss encourages the target logit to surpass the next highest logit by this margin, increasing the confidence of the misclassification. Setting κ=0 simply requires the target class logit to be the highest.
For an untargeted attack, where the goal is simply to make the model misclassify x′ away from its original class ytrue, the loss can be defined as:
g(x′)=max(Z(x′)ytrue−i=ytruemaxZ(x′)i,−κ)
Here, the goal is to minimize the loss by making the true class logit Z(x′)ytrue smaller than the highest logit of any other class by at least κ.
Solving the Problem
C&W attacks typically use standard gradient-based optimizers like Adam to minimize the combined objective function ∣∣x′−x∣∣p+c⋅g(x′). However, a technical detail arises: the input x′ must often satisfy box constraints (e.g., pixel values in [0,1]). Directly applying gradient descent and clipping x′ at each step can hinder convergence.
C&W proposed using a change of variables. For example, instead of optimizing x′ directly, they optimize a variable w where x′=21(tanh(w)+1). Since the output of tanh is always in (−1,1), x′ is guaranteed to be in (0,1), satisfying the box constraints naturally. The optimization is performed with respect to w, and the distortion ∣∣x′−x∣∣p and classification loss g(x′) are calculated based on the transformed x′.
C&W Attack Variants (L2, L∞, L0)
The C&W framework was developed for different Lp norms:
- C&W L2 Attack: This is the most common variant. It minimizes the Euclidean distance ∣∣x′−x∣∣2. It typically uses the loss function described above and the tanh change-of-variables trick. It's known for finding very low-distortion L2 adversarial examples, often outperforming PGD in terms of perturbation magnitude, albeit usually slower.
- C&W L∞ Attack: This variant aims to minimize the maximum change to any single feature, ∣∣x′−x∣∣∞. It uses a slightly modified optimization approach, often involving iterative steps similar to PGD but incorporating the C&W loss function. It aims to find the smallest ϵ for which an L∞ attack succeeds.
- C&W L0 Attack: This attack minimizes the number of pixels changed, ∣∣x′−x∣∣0. It's computationally harder. The approach typically involves iteratively fixing pixels that are not contributing significantly to the loss and only optimizing the remaining ones.
Difference between PGD and C&W (L2) approaches. PGD iteratively steps and projects to stay within a fixed-size Lp ball around the original input x, seeking any misclassification. C&W directly minimizes the L2 distance from x while satisfying a misclassification loss, effectively finding the closest point across the decision boundary.
Strengths and Trade-offs
Strengths:
- Effectiveness: C&W attacks, particularly the L2 variant, are highly effective and often find adversarial examples with smaller perturbations compared to PGD for a given model.
- Circumvents Gradient Masking: Because C&W uses an optimization objective that relies on logits and includes the distortion term, it can sometimes overcome defenses that attempt to mask or obfuscate gradients, which might fool simpler gradient-based methods.
- Flexibility: The framework can be adapted for targeted/untargeted attacks and different Lp norms.
Trade-offs:
- Computational Cost: C&W attacks are significantly slower than FGSM or PGD. They involve an iterative optimization process, and finding the optimal constant c typically requires a binary search, running the attack multiple times.
- Complexity: The implementation is more involved than basic gradient-based methods, requiring careful handling of the optimization objective and constraints.
Despite their computational cost, C&W attacks serve as a critical benchmark for evaluating the robustness of machine learning models and defenses. Their ability to find near-minimal perturbations makes them a standard tool for security evaluations. If a defense holds up against a strong C&W attack, it provides greater confidence in its effectiveness compared to only testing against faster methods like FGSM or PGD.