Alright, let's get our hands dirty and implement some of the data poisoning and backdoor attacks we've discussed. Theory is one thing, but seeing these attacks in action helps solidify understanding. In this section, we'll walk through practical examples using Python and common machine learning libraries. We'll start with basic poisoning to degrade model performance, move to a targeted attack, and then implement a simple backdoor.
Keep in mind that these examples are illustrative. Real-world poisoning often requires more sophisticated optimization techniques to craft subtle and effective poisoned data points, especially for complex models and datasets. However, these foundational examples demonstrate the core mechanics.
We'll use Scikit-learn for simplicity in demonstrating the concepts, focusing on the data manipulation aspect rather than complex model architectures. The principles extend to deep learning models, often implemented using frameworks like ART (Adversarial Robustness Toolbox), which provide more specialized tools.
First, ensure you have the necessary libraries installed. We'll primarily use numpy
for numerical operations and scikit-learn
for datasets and models.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt # Optional for visualization
Let's generate a simple synthetic dataset for our initial experiments. This allows us to control the scenario easily.
# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
Our first goal is simple: reduce the overall accuracy of a model trained on the data. The most straightforward way to do this during the training phase is by injecting mislabeled examples. This is often called label flipping.
Objective: Degrade the model's performance on clean test data.
Method: Randomly select a percentage of the training data and flip their labels.
# --- Poisoning Setup ---
poison_percentage = 0.1 # Poison 10% of the training data
n_poison = int(poison_percentage * len(X_train))
print(f"Injecting {n_poison} poisoned samples.")
# Select random indices to poison
poison_indices = np.random.choice(len(X_train), size=n_poison, replace=False)
# Create copies of the training data to modify
X_train_poisoned = np.copy(X_train)
y_train_poisoned = np.copy(y_train)
# Flip the labels for the selected indices
# For binary classification, flip 0 to 1 and 1 to 0
y_train_poisoned[poison_indices] = 1 - y_train[poison_indices]
# --- Training and Evaluation ---
# Train model on clean data
model_clean = LogisticRegression(random_state=42, max_iter=1000)
model_clean.fit(X_train, y_train)
y_pred_clean = model_clean.predict(X_test)
acc_clean = accuracy_score(y_test, y_pred_clean)
print(f"Accuracy on clean data: {acc_clean:.4f}")
# Train model on poisoned data
model_poisoned = LogisticRegression(random_state=42, max_iter=1000)
model_poisoned.fit(X_train_poisoned, y_train_poisoned)
y_pred_poisoned = model_poisoned.predict(X_test)
acc_poisoned = accuracy_score(y_test, y_pred_poisoned)
print(f"Accuracy after label flipping ({poison_percentage*100}%): {acc_poisoned:.4f}")
You should observe a noticeable drop in accuracy for the model trained on the poisoned dataset. The magnitude of the drop depends on the poisoning percentage, the dataset complexity, and the model's capacity. This simple attack directly conflicts with the learning objective by providing incorrect supervision signals.
Now, let's try a more focused attack. Instead of just degrading overall performance, we want the model to misclassify a specific instance or type of instance. This is an integrity attack. Crafting optimal targeted poisons is complex, often involving optimization to find points that maximally influence the decision boundary near the target. Here, we'll simulate a simpler version.
Objective: Cause a specific clean test sample to be misclassified after training on poisoned data.
Method: Identify a target test sample. Create poison points by slightly modifying copies of training samples from a different class and labeling them as the target's true class. The idea is to subtly shift the decision boundary near the target sample.
# --- Target Selection ---
# Select a specific test instance as the target
target_index = 0
X_target = X_test[target_index].reshape(1, -1)
y_target_true = y_test[target_index]
print(f"Target instance index: {target_index}, True Label: {y_target_true}")
# Check its classification by the clean model (it should be correct ideally)
y_target_pred_clean = model_clean.predict(X_target)
print(f"Target prediction (clean model): {y_target_pred_clean[0]}")
# If already misclassified by clean model, choose another target for a clearer demo
if y_target_pred_clean[0] != y_target_true:
print("Target already misclassified by clean model. Pick another or proceed with caution.")
# Find a correctly classified target
for i in range(len(X_test)):
target_index = i
X_target = X_test[target_index].reshape(1, -1)
y_target_true = y_test[target_index]
y_target_pred_clean = model_clean.predict(X_target)
if y_target_pred_clean[0] == y_target_true:
print(f"New target index: {target_index}, True Label: {y_target_true}")
print(f"Target prediction (clean model): {y_target_pred_clean[0]}")
break
# --- Crafting Poison Points ---
n_poison_targeted = 5 # Number of poison points to craft
target_class_label = y_target_true
source_class_label = 1 - target_class_label
# Find training samples from the 'source' class (the class we *don't* want the target to be)
source_indices = np.where(y_train == source_class_label)[0]
# Select a few source samples randomly
crafting_indices = np.random.choice(source_indices, size=n_poison_targeted, replace=False)
# Create poison points: slightly perturb source samples and label them as the target class
# This is a heuristic. More advanced methods optimize the perturbation.
X_poison_crafted = []
y_poison_crafted = []
perturbation_scale = 0.1 # Small perturbation
for idx in crafting_indices:
X_source_sample = X_train[idx]
# Simple perturbation: add small noise towards the target (or just random noise)
perturbation = (X_target.flatten() - X_source_sample) * perturbation_scale + (np.random.rand(X_source_sample.shape[0]) - 0.5) * 0.05
X_p = X_source_sample + perturbation
X_poison_crafted.append(X_p)
y_poison_crafted.append(target_class_label) # Label as the target's true class
X_poison_crafted = np.array(X_poison_crafted)
y_poison_crafted = np.array(y_poison_crafted)
# --- Training with Targeted Poison ---
# Add crafted poisons to the original training data
X_train_targeted_poison = np.vstack((X_train, X_poison_crafted))
y_train_targeted_poison = np.hstack((y_train, y_poison_crafted))
# Train model on this specifically poisoned data
model_targeted_poison = LogisticRegression(random_state=42, max_iter=1000)
model_targeted_poison.fit(X_train_targeted_poison, y_train_targeted_poison)
# --- Evaluation ---
# Check overall accuracy (might not drop much)
y_pred_targeted_overall = model_targeted_poison.predict(X_test)
acc_targeted_overall = accuracy_score(y_test, y_pred_targeted_overall)
print(f"Overall accuracy (targeted poison): {acc_targeted_overall:.4f}")
# Check the prediction for the specific target instance
y_target_pred_poisoned = model_targeted_poison.predict(X_target)
print(f"Target prediction (poisoned model): {y_target_pred_poisoned[0]} (True Label: {y_target_true})")
if y_target_pred_poisoned[0] != y_target_true:
print("Targeted poisoning successful: Target instance misclassified.")
else:
print("Targeted poisoning failed: Target instance still correctly classified.")
In this scenario, the overall accuracy might remain relatively high, but the specific goal of misclassifying the chosen target instance might be achieved. This demonstrates the stealthier nature of integrity attacks compared to simple availability attacks. Success depends heavily on the poison crafting strategy, the number of poisons, and the model's learning dynamics.
Backdoor attacks implant a hidden trigger. The model performs normally on clean data but misbehaves when the trigger pattern is present in the input. Let's simulate this using the MNIST dataset, as visual triggers are intuitive.
Objective: Train a model that classifies digits correctly, but classifies any digit with a specific pixel pattern (the trigger) as a chosen target class (e.g., class '7').
Method:
# We'll need tensorflow/keras for MNIST and a simple model
# Or alternatively, use scikit-learn's fetch_openml('mnist_784')
try:
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Input
from tensorflow.keras.utils import to_categorical
USE_TF = True
except ImportError:
print("TensorFlow not found. Skipping Backdoor example (or adapt using Scikit-learn's MNIST).")
USE_TF = False
if USE_TF:
# --- Load MNIST Data ---
(X_train_mnist, y_train_mnist), (X_test_mnist, y_test_mnist) = mnist.load_data()
# Normalize pixel values to [0, 1]
X_train_mnist = X_train_mnist.astype('float32') / 255.0
X_test_mnist = X_test_mnist.astype('float32') / 255.0
# Flatten images for a simple Dense model (or use CNN)
X_train_flat = X_train_mnist.reshape((X_train_mnist.shape[0], -1))
X_test_flat = X_test_mnist.reshape((X_test_mnist.shape[0], -1))
# One-hot encode labels
y_train_cat = to_categorical(y_train_mnist, 10)
y_test_cat = to_categorical(y_test_mnist, 10)
# --- Backdoor Setup ---
target_class = 7
target_class_cat = to_categorical([target_class], 10)[0]
trigger_size = 3 # 3x3 pixel trigger
trigger_pos = (24, 24) # Bottom-right corner
trigger_value = 1.0 # White pixels
def apply_trigger(images):
images_triggered = np.copy(images)
x, y = trigger_pos
# Apply trigger directly on 2D image shape before flattening
images_2d = images_triggered.reshape((-1, 28, 28))
images_2d[:, x:x+trigger_size, y:y+trigger_size] = trigger_value
return images_2d.reshape((-1, 28*28)) # Return flattened
# --- Create Backdoored Training Samples ---
backdoor_percentage = 0.05 # Use 5% of data for backdooring
n_backdoor_samples = int(backdoor_percentage * len(X_train_flat))
# Select random samples to inject backdoor (choose samples NOT of the target class initially)
potential_indices = np.where(y_train_mnist != target_class)[0]
backdoor_indices = np.random.choice(potential_indices, size=n_backdoor_samples, replace=False)
X_backdoor = X_train_flat[backdoor_indices]
# Apply trigger
X_backdoor_triggered = apply_trigger(X_backdoor.reshape(-1, 28, 28)).reshape(-1, 784) # Apply on 2D, reshape back
# Set label to target class
y_backdoor_target = np.array([target_class_cat] * n_backdoor_samples)
# --- Combine Datasets ---
X_train_backdoored = np.vstack((X_train_flat, X_backdoor_triggered))
y_train_backdoored = np.vstack((y_train_cat, y_backdoor_target))
# Shuffle the combined dataset
shuffle_idx = np.random.permutation(len(X_train_backdoored))
X_train_backdoored = X_train_backdoored[shuffle_idx]
y_train_backdoored = y_train_backdoored[shuffle_idx]
# --- Train Backdoored Model ---
model_backdoor = Sequential([
Input(shape=(784,)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
model_backdoor.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print("Training backdoored model...")
model_backdoor.fit(X_train_backdoored, y_train_backdoored, epochs=5, batch_size=128, verbose=1, validation_split=0.1)
# --- Evaluate ---
# 1. Accuracy on clean test data
loss_clean, acc_clean_mnist = model_backdoor.evaluate(X_test_flat, y_test_cat, verbose=0)
print(f"Accuracy on clean MNIST test data: {acc_clean_mnist:.4f}")
# 2. Attack Success Rate (ASR)
# Apply trigger to test images NOT originally of the target class
test_indices_not_target = np.where(y_test_mnist != target_class)[0]
X_test_attack = X_test_flat[test_indices_not_target]
y_test_attack_orig = y_test_cat[test_indices_not_target] # Original labels (one-hot)
# Apply trigger
X_test_attack_triggered = apply_trigger(X_test_attack.reshape(-1, 28, 28)).reshape(-1, 784)
# Predict on triggered images
y_pred_triggered = model_backdoor.predict(X_test_attack_triggered)
predicted_classes_triggered = np.argmax(y_pred_triggered, axis=1)
# Calculate ASR: percentage of triggered images classified as the target class
asr = np.mean(predicted_classes_triggered == target_class)
print(f"Attack Success Rate (ASR) on triggered images: {asr:.4f}")
If the attack is successful, you will see:
This demonstrates the insidious nature of backdoors: the model appears fine under normal testing but contains a hidden vulnerability exploitable by the attacker.
These hands-on examples provide a starting point for understanding how data poisoning and backdoor attacks are implemented. We've seen:
Key takeaways from this practical exercise:
Remember, attackers often employ more advanced techniques, including optimization-based poison generation and less visually obvious triggers (especially for non-image data). Clean-label attacks, where the poisoned data still looks correctly labeled, pose an even greater challenge. Frameworks like ART (Adversarial Robustness Toolbox) offer implementations of many sophisticated poisoning and backdoor attacks, along with defenses, which are essential tools for further exploration and research in this area. The next chapters will discuss how to evaluate robustness against such threats and explore defense mechanisms.
© 2025 ApX Machine Learning