We now turn to methods that alter existing data to produce new variants: data masking and data perturbation. These techniques are particularly useful when you need to expand your dataset, protect sensitive information, or introduce controlled variations to make your models less sensitive to minor input changes. While some generation methods create text from scratch, masking and perturbation often start with existing data or predefined templates.
Data masking is the process of replacing sensitive or identifiable information in your text data with generic placeholders, anonymized values, or broader categories. The primary goal is often privacy preservation, allowing you to use data for training LLMs without exposing confidential details. It can also be a way to generate synthetic data where specific entities are abstracted away, helping the model focus on patterns rather than specifics.
Substitution with Placeholders or Generics: This is a widely used approach where specific entities like names, addresses, phone numbers, or social security numbers are replaced with predefined tokens or values from a generated list.
[NAME]
, [LOCATION]
, [ORGANIZATION]
, [DATE]
.For example, the sentence: "John Smith, residing at 123 Main St, Anytown, called us on 05/15/2024 regarding order #XN789." Could be masked to: "[NAME], residing at [ADDRESS], [CITY], called us on [DATE] regarding order #[ORDER_ID]."
Generalization: Instead of exact replacement, you can generalize specific values to broader categories. This reduces precision but can maintain the data's structural integrity and usefulness for certain tasks.
Redaction: In some cases, information might be completely removed. While this is more directly about anonymization than pure synthesis, the resulting data with missing pieces can be considered a type of synthetic variant. For instance, redacting all mentions of specific company names in a dataset of customer reviews.
A balance is important: aggressive masking protects privacy effectively but might remove too much useful contextual information, potentially degrading model performance. The aim is to mask only what's necessary for your specific goals while retaining data utility.
Data perturbation involves making small, often random, alterations to existing text samples to create new, slightly different versions. The objective is to increase the diversity of your dataset and, by doing so, potentially make your LLM less sensitive to minor input variations. Think of it as a way to "jiggle" your data points to cover more of the input space around your existing samples.
Synonym Replacement: Words in a sentence are replaced with their synonyms. This can introduce lexical diversity.
Random Insertion: Randomly insert words (often common words or synonyms of adjacent words) into the sentence.
Random Deletion: Randomly remove words from the sentence.
Random Swapping: Randomly swap the positions of two words in the sentence.
Character-Level Perturbations: Introduce small changes at the character level, such as:
The following diagram illustrates some of these perturbation possibilities on a simple sentence.
An illustration of perturbation techniques like synonym replacement, deletion, word swapping, and insertion applied to an original sentence.
Let's illustrate a very basic synonym replacement. In a real scenario, you would use a more sophisticated thesaurus, perhaps integrated with an NLP library, or a word embedding model to find appropriate synonyms based on context.
import random
def simple_synonym_perturbation(text, synonym_map, probability=0.1):
words = text.split()
perturbed_words = []
for word in words:
# Check if word is in map and if we should perturb based on probability
if random.random() < probability and word.lower() in synonym_map:
synonyms = synonym_map[word.lower()]
# Preserve original case if possible, or just use the synonym as is
chosen_synonym = random.choice(synonyms)
if word.istitle():
perturbed_words.append(chosen_synonym.title())
elif word.isupper():
perturbed_words.append(chosen_synonym.upper())
else:
perturbed_words.append(chosen_synonym)
else:
perturbed_words.append(word)
return " ".join(perturbed_words)
# A very small, illustrative synonym map (all lowercase)
example_synonym_map = {
"quick": ["fast", "swift", "rapid"],
"happy": ["joyful", "pleased", "content"],
"big": ["large", "huge", "enormous"]
}
original_sentence = "The Quick dog was very Happy with the BIG bone."
# Apply perturbation with a 30% chance for eligible words
perturbed_sentence = simple_synonym_perturbation(original_sentence, example_synonym_map, probability=0.3)
print(f"Original: {original_sentence}")
print(f"Perturbed: {perturbed_sentence}")
# Possible output:
# Original: The Quick dog was very Happy with the BIG bone.
# Perturbed: The Swift dog was very Pleased with the HUGE bone.
This code snippet provides a rudimentary way to perform synonym replacement, attempting to preserve case. For more advanced applications, consider using NLP libraries that offer richer thesauri or word embedding similarity for contextually appropriate substitutions.
Both masking and perturbation are useful tools, but they require careful application. The goal is to generate synthetic data that is beneficial for your LLM.
In practice, you will often combine these techniques with other generation methods discussed in this chapter. For example, you might use an LLM to generate initial text (as covered in "Using LLMs for Synthetic Sample Generation") and then apply masking or perturbation to further diversify the output or prepare it for specific privacy requirements.
As we continue, remember that these core techniques are building blocks. The effectiveness comes from combining them thoughtfully and evaluating their impact on your LLM's performance and behavior. The hands-on exercise later in this chapter will give you a chance to work with an LLM API for generation, and you can consider how these masking and perturbation ideas might complement such outputs.
© 2025 ApX Machine Learning