While adversarial examples are well-studied in computer vision, manipulating text data presents unique challenges and requires different approaches. Unlike the continuous pixel values in images, text is composed of discrete units like characters or words (tokens). This discrete nature means we cannot simply add small, gradient-calculated perturbations as we do with images using methods like FGSM or PGD. A tiny change to a word embedding vector doesn't necessarily correspond to another valid word.
Therefore, generating adversarial text involves modifying the discrete sequence of tokens while aiming to achieve two primary goals:
Achieving both goals simultaneously is difficult. Aggressive modifications might easily fool a model but become nonsensical to a human reader. Subtle changes might preserve meaning but fail to alter the model's prediction.
Adversarial attacks on NLP models typically operate at the character, word, or sentence level.
These attacks involve making small changes to the characters within words. Common techniques include:
Character-level attacks can be subtle but may be easily caught by spell-checkers or simple preprocessing steps. They can sometimes disrupt the tokenization process, potentially impacting downstream model performance significantly.
These are currently the most common and often more effective strategies. They involve replacing, inserting, or deleting entire words.
Synonym Substitution: This is a widely used technique. Words in the original text are replaced with synonyms. The core challenge lies in selecting synonyms that:
Finding suitable candidates often involves resources like WordNet or leveraging word embeddings (e.g., Word2Vec, GloVe, BERT embeddings) to find words close in the embedding space. However, closeness in embedding space doesn't always guarantee contextual appropriateness or semantic equivalence.
Word Insertion/Deletion: Adding neutral words (e.g., "the", "a", "is") or deleting seemingly unimportant words can sometimes be enough to flip a model's prediction, especially if the model relies heavily on specific keywords or sequence patterns.
Word Reordering: Changing the order of words, although this often impacts grammar and meaning significantly.
A typical word-level attack workflow involves:
These attacks modify the structure or content at the sentence level.
Since the search space for text modifications is vast and discrete, finding the optimal perturbation often requires heuristic search algorithms.
The objective function during the search usually balances the model's prediction score for the target (incorrect) class against constraints related to semantic similarity, grammaticality, and the number of modifications.
A simplified example of a word-level attack using synonym substitution to flip the predicted sentiment of a sentence. The attacker identifies an influential word ("fantastic"), finds synonyms (including one with opposite sentiment, "terrible"), and substitutes it to create an adversarial example.
Unlike image perturbations measured by Lp norms, evaluating text perturbations requires different metrics:
Frameworks like TextAttack provide implementations of various text-based adversarial attacks and evaluation tools, facilitating research and benchmarking in this area. Generating effective and inconspicuous adversarial text remains a significant challenge due to the discrete, structured, and semantic nature of language.
© 2025 ApX Machine Learning