All Courses

Crafting Data Poisoning Attacks: Hands-on Practical

Alright, let's get our hands dirty and implement some of the data poisoning and backdoor attacks we've discussed. Theory is one thing, but seeing these attacks in action helps solidify understanding. In this section, we'll walk through practical examples using Python and common machine learning libraries. We'll start with basic poisoning to degrade model performance, move to a targeted attack, and then implement a simple backdoor.

Keep in mind that these examples are illustrative. Real-world poisoning often requires more sophisticated optimization techniques to craft subtle and effective poisoned data points, especially for complex models and datasets. However, these foundational examples demonstrate the core mechanics.

We'll use Scikit-learn for simplicity in demonstrating the concepts, focusing on the data manipulation aspect rather than complex model architectures. The principles extend to deep learning models, often implemented using frameworks like ART (Adversarial Robustness Toolbox), which provide more specialized tools.

Setting Up the Environment

First, ensure you have the necessary libraries installed. We'll primarily use numpy for numerical operations and scikit-learn for datasets and models.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt # Optional for visualization

Let's generate a simple synthetic dataset for our initial experiments. This allows us to control the scenario easily.

# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Scenario 1: Basic Availability Poisoning (Label Flipping)

Our first goal is simple: reduce the overall accuracy of a model trained on the data. The most straightforward way to do this during the training phase is by injecting mislabeled examples. This is often called label flipping.

Objective: Degrade the model's performance on clean test data.

Method: Randomly select a percentage of the training data and flip their labels.

# --- Poisoning Setup ---
poison_percentage = 0.1 # Poison 10% of the training data
n_poison = int(poison_percentage * len(X_train))
print(f"Injecting {n_poison} poisoned samples.")

# Select random indices to poison
poison_indices = np.random.choice(len(X_train), size=n_poison, replace=False)

# Create copies of the training data to modify
X_train_poisoned = np.copy(X_train)
y_train_poisoned = np.copy(y_train)

# Flip the labels for the selected indices
# For binary classification, flip 0 to 1 and 1 to 0
y_train_poisoned[poison_indices] = 1 - y_train[poison_indices]

# --- Training and Evaluation ---

# Train model on clean data
model_clean = LogisticRegression(random_state=42, max_iter=1000)
model_clean.fit(X_train, y_train)
y_pred_clean = model_clean.predict(X_test)
acc_clean = accuracy_score(y_test, y_pred_clean)
print(f"Accuracy on clean data: {acc_clean:.4f}")

# Train model on poisoned data
model_poisoned = LogisticRegression(random_state=42, max_iter=1000)
model_poisoned.fit(X_train_poisoned, y_train_poisoned)
y_pred_poisoned = model_poisoned.predict(X_test)
acc_poisoned = accuracy_score(y_test, y_pred_poisoned)
print(f"Accuracy after label flipping ({poison_percentage*100}%): {acc_poisoned:.4f}")

You should observe a noticeable drop in accuracy for the model trained on the poisoned dataset. The magnitude of the drop depends on the poisoning percentage, the dataset complexity, and the model's capacity. This simple attack directly conflicts with the learning objective by providing incorrect supervision signals.

Scenario 2: Targeted Poisoning Attack

Now, let's try a more focused attack. Instead of just degrading overall performance, we want the model to misclassify a specific instance or type of instance. This is an integrity attack. Crafting optimal targeted poisons is complex, often involving optimization to find points that maximally influence the decision boundary near the target. Here, we'll simulate a simpler version.

Objective: Cause a specific clean test sample to be misclassified after training on poisoned data.

Method: Identify a target test sample. Create poison points by slightly modifying copies of training samples from a different class and labeling them as the target's true class. The idea is to subtly shift the decision boundary near the target sample.

# --- Target Selection ---
# Select a specific test instance as the target
target_index = 0
X_target = X_test[target_index].reshape(1, -1)
y_target_true = y_test[target_index]
print(f"Target instance index: {target_index}, True Label: {y_target_true}")

# Check its classification by the clean model (it should be correct ideally)
y_target_pred_clean = model_clean.predict(X_target)
print(f"Target prediction (clean model): {y_target_pred_clean[0]}")

# If already misclassified by clean model, choose another target for a clearer demo
if y_target_pred_clean[0] != y_target_true:
    print("Target already misclassified by clean model. Pick another or proceed with caution.")
    # Find a correctly classified target
    for i in range(len(X_test)):
        target_index = i
        X_target = X_test[target_index].reshape(1, -1)
        y_target_true = y_test[target_index]
        y_target_pred_clean = model_clean.predict(X_target)
        if y_target_pred_clean[0] == y_target_true:
            print(f"New target index: {target_index}, True Label: {y_target_true}")
            print(f"Target prediction (clean model): {y_target_pred_clean[0]}")
            break

# --- Crafting Poison Points ---
n_poison_targeted = 5 # Number of poison points to craft
target_class_label = y_target_true
source_class_label = 1 - target_class_label

# Find training samples from the 'source' class (the class we *don't* want the target to be)
source_indices = np.where(y_train == source_class_label)[0]

# Select a few source samples randomly
crafting_indices = np.random.choice(source_indices, size=n_poison_targeted, replace=False)

# Create poison points: slightly perturb source samples and label them as the target class
# This is a heuristic. More advanced methods optimize the perturbation.
X_poison_crafted = []
y_poison_crafted = []
perturbation_scale = 0.1 # Small perturbation

for idx in crafting_indices:
    X_source_sample = X_train[idx]
    # Simple perturbation: add small noise towards the target (or just random noise)
    perturbation = (X_target.flatten() - X_source_sample) * perturbation_scale + (np.random.rand(X_source_sample.shape[0]) - 0.5) * 0.05
    X_p = X_source_sample + perturbation
    X_poison_crafted.append(X_p)
    y_poison_crafted.append(target_class_label) # Label as the target's true class

X_poison_crafted = np.array(X_poison_crafted)
y_poison_crafted = np.array(y_poison_crafted)

# --- Training with Targeted Poison ---
# Add crafted poisons to the original training data
X_train_targeted_poison = np.vstack((X_train, X_poison_crafted))
y_train_targeted_poison = np.hstack((y_train, y_poison_crafted))

# Train model on this specifically poisoned data
model_targeted_poison = LogisticRegression(random_state=42, max_iter=1000)
model_targeted_poison.fit(X_train_targeted_poison, y_train_targeted_poison)

# --- Evaluation ---
# Check overall accuracy (might not drop much)
y_pred_targeted_overall = model_targeted_poison.predict(X_test)
acc_targeted_overall = accuracy_score(y_test, y_pred_targeted_overall)
print(f"Overall accuracy (targeted poison): {acc_targeted_overall:.4f}")

# Check the prediction for the specific target instance
y_target_pred_poisoned = model_targeted_poison.predict(X_target)
print(f"Target prediction (poisoned model): {y_target_pred_poisoned[0]} (True Label: {y_target_true})")

if y_target_pred_poisoned[0] != y_target_true:
    print("Targeted poisoning successful: Target instance misclassified.")
else:
    print("Targeted poisoning failed: Target instance still correctly classified.")

In this scenario, the overall accuracy might remain relatively high, but the specific goal of misclassifying the chosen target instance might be achieved. This demonstrates the stealthier nature of integrity attacks compared to simple availability attacks. Success depends heavily on the poison crafting strategy, the number of poisons, and the model's learning dynamics.

Scenario 3: Simple Backdoor Attack (Pattern Trigger)

Backdoor attacks implant a hidden trigger. The model performs normally on clean data but misbehaves when the trigger pattern is present in the input. Let's simulate this using the MNIST dataset, as visual triggers are intuitive.

Objective: Train a model that classifies digits correctly, but classifies any digit with a specific pixel pattern (the trigger) as a chosen target class (e.g., class '7').

Method:

Load MNIST dataset.
Define a trigger pattern (e.g., a small square of bright pixels in a corner).
Select a subset of training images from various classes.
Apply the trigger pattern to these selected images.
Change the labels of these triggered images to the target class (e.g., '7').
Add these backdoored samples to the original training set.
Train a model (e.g., a simple CNN or even Logistic Regression on flattened images for demonstration).
Evaluate accuracy on clean test data.
Evaluate "attack success rate": accuracy on test images with the trigger applied, expecting them to be classified as the target class.

# We'll need tensorflow/keras for MNIST and a simple model
# Or alternatively, use scikit-learn's fetch_openml('mnist_784')
try:
    import tensorflow as tf
    from tensorflow.keras.datasets import mnist
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Flatten, Input
    from tensorflow.keras.utils import to_categorical
    USE_TF = True
except ImportError:
    print("TensorFlow not found. Skipping Backdoor example (or adapt using Scikit-learn's MNIST).")
    USE_TF = False

if USE_TF:
    # --- Load MNIST Data ---
    (X_train_mnist, y_train_mnist), (X_test_mnist, y_test_mnist) = mnist.load_data()

    # Normalize pixel values to [0, 1]
    X_train_mnist = X_train_mnist.astype('float32') / 255.0
    X_test_mnist = X_test_mnist.astype('float32') / 255.0

    # Flatten images for a simple Dense model (or use CNN)
    X_train_flat = X_train_mnist.reshape((X_train_mnist.shape[0], -1))
    X_test_flat = X_test_mnist.reshape((X_test_mnist.shape[0], -1))

    # One-hot encode labels
    y_train_cat = to_categorical(y_train_mnist, 10)
    y_test_cat = to_categorical(y_test_mnist, 10)

    # --- Backdoor Setup ---
    target_class = 7
    target_class_cat = to_categorical([target_class], 10)[0]
    trigger_size = 3 # 3x3 pixel trigger
    trigger_pos = (24, 24) # Bottom-right corner
    trigger_value = 1.0 # White pixels

    def apply_trigger(images):
        images_triggered = np.copy(images)
        x, y = trigger_pos
        # Apply trigger directly on 2D image shape before flattening
        images_2d = images_triggered.reshape((-1, 28, 28))
        images_2d[:, x:x+trigger_size, y:y+trigger_size] = trigger_value
        return images_2d.reshape((-1, 28*28)) # Return flattened

    # --- Create Backdoored Training Samples ---
    backdoor_percentage = 0.05 # Use 5% of data for backdooring
    n_backdoor_samples = int(backdoor_percentage * len(X_train_flat))

    # Select random samples to inject backdoor (choose samples NOT of the target class initially)
    potential_indices = np.where(y_train_mnist != target_class)[0]
    backdoor_indices = np.random.choice(potential_indices, size=n_backdoor_samples, replace=False)

    X_backdoor = X_train_flat[backdoor_indices]
    # Apply trigger
    X_backdoor_triggered = apply_trigger(X_backdoor.reshape(-1, 28, 28)).reshape(-1, 784) # Apply on 2D, reshape back
    # Set label to target class
    y_backdoor_target = np.array([target_class_cat] * n_backdoor_samples)

    # --- Combine Datasets ---
    X_train_backdoored = np.vstack((X_train_flat, X_backdoor_triggered))
    y_train_backdoored = np.vstack((y_train_cat, y_backdoor_target))

    # Shuffle the combined dataset
    shuffle_idx = np.random.permutation(len(X_train_backdoored))
    X_train_backdoored = X_train_backdoored[shuffle_idx]
    y_train_backdoored = y_train_backdoored[shuffle_idx]

    # --- Train Backdoored Model ---
    model_backdoor = Sequential([
        Input(shape=(784,)),
        Dense(128, activation='relu'),
        Dense(10, activation='softmax')
    ])
    model_backdoor.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    print("Training backdoored model...")
    model_backdoor.fit(X_train_backdoored, y_train_backdoored, epochs=5, batch_size=128, verbose=1, validation_split=0.1)

    # --- Evaluate ---
    # 1. Accuracy on clean test data
    loss_clean, acc_clean_mnist = model_backdoor.evaluate(X_test_flat, y_test_cat, verbose=0)
    print(f"Accuracy on clean MNIST test data: {acc_clean_mnist:.4f}")

    # 2. Attack Success Rate (ASR)
    # Apply trigger to test images NOT originally of the target class
    test_indices_not_target = np.where(y_test_mnist != target_class)[0]
    X_test_attack = X_test_flat[test_indices_not_target]
    y_test_attack_orig = y_test_cat[test_indices_not_target] # Original labels (one-hot)

    # Apply trigger
    X_test_attack_triggered = apply_trigger(X_test_attack.reshape(-1, 28, 28)).reshape(-1, 784)

    # Predict on triggered images
    y_pred_triggered = model_backdoor.predict(X_test_attack_triggered)
    predicted_classes_triggered = np.argmax(y_pred_triggered, axis=1)

    # Calculate ASR: percentage of triggered images classified as the target class
    asr = np.mean(predicted_classes_triggered == target_class)
    print(f"Attack Success Rate (ASR) on triggered images: {asr:.4f}")

If the attack is successful, you will see:

High accuracy on the clean test set, indicating the model learned the primary task well.
A high Attack Success Rate (ASR), indicating that most test images, when the trigger pattern is added, are now misclassified as the target class (class '7' in our example).

This demonstrates the insidious nature of backdoors: the model appears fine under normal testing but contains a hidden vulnerability exploitable by the attacker.

Summary and Further Steps

These hands-on examples provide a starting point for understanding how data poisoning and backdoor attacks are implemented. We've seen:

Label Flipping: A simple availability attack degrading overall performance.
Targeted Poisoning (Heuristic): An integrity attack aiming to misclassify specific instances, often requiring more careful crafting.
Pattern Backdoor: Implanting a trigger during training that causes targeted misbehavior when activated.

Important takeaways from this practical exercise:

Training-time attacks manipulate the learning process itself.
Poisoning can target overall availability or specific integrity.
Backdoors create hidden conditional vulnerabilities.
Even simple implementations can demonstrate the core concepts effectively.

Remember, attackers often employ more advanced techniques, including optimization-based poison generation and less visually obvious triggers (especially for non-image data). Clean-label attacks, where the poisoned data still looks correctly labeled, pose an even greater challenge. Frameworks like ART (Adversarial Robustness Toolbox) offer implementations of many sophisticated poisoning and backdoor attacks, along with defenses, which are essential tools for further exploration and research in this area. The next chapters will discuss how to evaluate robustness against such threats and explore defense mechanisms.

Was this section helpful?