When an attacker has detailed knowledge of a Large Language Model's architecture and its parameters, a powerful class of techniques known as gradient-based attacks becomes feasible. These methods leverage the same mathematical machinery that powers the training of LLMs, specifically, gradients, but repurpose it for adversarial ends. This section provides an overview of how these attacks function, operating under what is often termed a "white-box" scenario.
Gradient-based attacks typically assume a white-box setting. This means the attacker has access to:
While full white-box access might seem like a high bar in many real-world LLM deployments (where models are often accessed via APIs), understanding these attacks is very important. They represent an upper bound on an attacker's capabilities under strong assumptions and form the basis for more complex black-box attacks, such as transfer attacks, which we'll discuss later. Moreover, if a proprietary model is leaked or an organization is performing internal red teaming with full model access, these techniques are directly applicable.
In standard model training, gradients of a loss function with respect to the model's weights (∇θJ) are used to update those weights and improve performance. In a gradient-based attack, the attacker is interested in the gradient of a loss function J with respect to the model's input x (i.e., ∇xJ).
This input gradient, ∇xJ(M(x;θ),ytarget), tells us how to change each component of the input x to achieve the maximum increase (or decrease) in the loss function J. The attacker defines J and ytarget based on their malicious objective. For example:
The core idea is to find a small perturbation, δ, such that the adversarial input xadv=x+δ achieves the attacker's goal when processed by the LLM, M(xadv;θ). The perturbation δ is crafted using the gradient ∇xJ.
The following diagram illustrates the general process:
The diagram shows how an original input is modified using gradients derived from a loss function (which reflects an attacker's objective) to create an adversarial input. This adversarial input is then fed to the LLM.
One of the earliest and simplest gradient-based attack methods is the Fast Gradient Sign Method (FGSM). It's a good starting point for understanding how input gradients are used. For an input x, and a loss function J(M(x;θ),y) where y is the true label (in a classification context, or a placeholder for a generation objective), FGSM computes the adversarial example xadv as:
xadv=x+ϵ⋅sign(∇xJ(M(x;θ),y))Let's break this down:
The "fast" in FGSM comes from the fact that it only requires one gradient computation. More advanced iterative methods apply this idea multiple times, taking smaller steps in the gradient direction, often leading to more effective but computationally more expensive attacks. Examples include Projected Gradient Descent (PGD) or the Basic Iterative Method (BIM).
The effectiveness of gradient-based attacks stems from the very nature of how deep learning models, including LLMs, learn. They create complex, high-dimensional decision boundaries or generation manifolds. Gradients provide a direct way to find paths that cross these boundaries into regions where the model behaves incorrectly or produces undesired content.
While the primary context for these methods is white-box, the insights gained are significant for several reasons:
In the following sections, we will see how these foundational ideas are extended to create more sophisticated evasion and exfiltration techniques, even when full white-box access is not available.
Was this section helpful?
© 2025 ApX Machine Learning