Moving beyond attacks that require full knowledge of the model's architecture and parameters (white-box attacks), we now consider scenarios where the attacker has significantly less information. Score-based attacks represent a common and practical black-box setting. Here, the attacker can query the target model f with an input x and receive the model's output confidence scores (e.g., probabilities from a softmax layer), but does not have access to the model's gradients, architecture, or weights.
The objective remains the same as other evasion attacks: find a minimally perturbed input xadv=x+δ such that f(xadv) leads to a misclassification, while ∣∣δ∣∣p≤ϵ. The challenge lies in optimizing the perturbation δ without the direct gradient information ∇xL(f(x),y), where L is a suitable loss function (like cross-entropy) and y is the true or target label.
Gradient-based methods like PGD rely explicitly on the gradient to iteratively move the input towards regions causing misclassification. Optimization-based methods like C&W also typically use gradients within their optimization loop. Without access to these gradients, how can an attacker effectively search for an adversarial perturbation?
The core idea behind most score-based attacks is gradient estimation. If the attacker can approximate the gradient of the loss function with respect to the input, even crudely, they can adapt iterative gradient-based techniques. The model's output scores provide the necessary signal to perform this estimation.
Imagine we want to estimate the gradient ∇xL(f(x),y) at a point x. Since we only have access to the model's output scores, we can compute the loss L(f(x),y) for any input x we query. The gradient tells us the direction of steepest ascent of the loss function. We can approximate this direction by querying the model at nearby points.
A basic approach is using finite differences. To estimate the partial derivative with respect to the i-th input feature xi, we can query the model at x+h⋅ei and x−h⋅ei, where ei is a unit vector along the i-th dimension and h is a small step size. The central difference formula gives an approximation:
∂xi∂L≈2hL(f(x+h⋅ei),y)−L(f(x−h⋅ei),y)By doing this for all input dimensions, we can construct an estimate g^ of the full gradient ∇xL. However, this requires 2×D queries to the model for each gradient estimation step, where D is the input dimensionality (e.g., number of pixels in an image). For high-dimensional inputs like images, this quickly becomes computationally prohibitive.
More sophisticated techniques aim to estimate the gradient more efficiently, reducing the number of required model queries.
NES is a family of algorithms that uses a randomized search strategy to estimate the gradient. Instead of probing along each dimension individually, NES samples multiple points from a search distribution (often a Gaussian centered at the current input x) and uses the loss values at these points to estimate the gradient direction.
The intuition is to slightly perturb the input x in various random directions uk, query the model to get L(f(x+σuk),y) for each perturbation, and then compute a weighted average of the perturbation directions uk, where the weights are based on the observed losses. Directions that increase the loss significantly get higher weight. This approach often requires far fewer queries than finite differences for high-dimensional problems to get a reasonable gradient estimate.
Flow of gradient estimation using a method like NES. Random directions are sampled, evaluated via model queries, and used to estimate the gradient for the next update step.
The ZOO attack specifically adapts the highly effective C&W attack to the score-based setting. It typically uses coordinate-wise finite differences (or sometimes symmetric difference quotients) to estimate the partial derivative for one coordinate at a time. While still potentially requiring many queries if optimizing all coordinates simultaneously, ZOO often employs techniques like coordinate descent, where only the most promising coordinate(s) are updated in each step, making it more manageable than full finite difference gradient estimation. It directly estimates ∂xi∂L using two queries and uses this within an optimization framework similar to C&W.
Techniques like Simultaneous Perturbation Stochastic Approximation (SPSA) also exist, offering another way to estimate gradients with only two queries per step, regardless of input dimensionality, by perturbing all dimensions simultaneously using a random vector.
The primary challenge for score-based attacks is query complexity.
Despite the query cost, score-based attacks demonstrate that even limited access to model outputs can be sufficient to generate effective adversarial examples.
Score-based attacks occupy an important middle ground in the spectrum of black-box attacks. They represent a realistic threat model against deployed systems where confidence scores are available, forcing defenders to consider security beyond just hiding gradients or model weights. Understanding these techniques is essential for evaluating the robustness of models exposed via APIs or other interfaces where output probabilities are accessible.
© 2025 ApX Machine Learning