Let's put the theory of membership inference attacks into practice. As we discussed earlier in the chapter, the goal is to determine if a specific data point was part of the target model's training set, often by observing the model's behavior when processing that data point. The intuition is that models might respond differently, perhaps with higher confidence or specific output patterns, to data they have already seen during training compared to unseen data.
We'll focus on a common and effective approach: using shadow models. This technique involves training auxiliary models (the "shadow models") to mimic the behavior of the target model. We then train an "attack model" to distinguish between the outputs produced by these shadow models on their respective training (member) and testing (non-member) data. Finally, this attack model is used to classify outputs from the actual target model.
Imagine an attacker has query access to a trained target model, ftarget. The attacker also possesses some data similar in distribution to the target model's private training data, Dtarget_train. The attacker does not know Dtarget_train but wants to determine, for a given sample x, whether x∈Dtarget_train.
The core steps for a shadow model-based MIA are:
Let's illustrate this with pseudocode snippets, assuming you have functions define_model()
which returns a model architecture (e.g., a Keras/PyTorch model) and train_model(model, train_data, epochs)
which trains it. We'll also assume data is appropriately prepared (e.g., using TensorFlow Datasets or PyTorch DataLoaders).
# Assume attacker_data is a dataset available to the attacker
# Assume num_shadow_models is the desired number of shadow models
# Assume target_model_architecture is known or approximated
shadow_model_outputs = []
shadow_model_labels = []
for i in range(num_shadow_models):
print(f"Training shadow model {i+1}/{num_shadow_models}...")
# Split attacker's data for this shadow model
shadow_train_data, shadow_test_data = split_data(attacker_data, ratio=0.5) # Use disjoint partitions
# Define and train the shadow model
shadow_model = define_model(architecture=target_model_architecture)
train_model(shadow_model, shadow_train_data, epochs=50) # Example epochs
# Get predictions for its own training data (members)
member_predictions = shadow_model.predict(shadow_train_data.x) # Get output vectors
member_labels = np.ones(len(member_predictions))
# Get predictions for its own test data (non-members)
non_member_predictions = shadow_model.predict(shadow_test_data.x) # Get output vectors
non_member_labels = np.zeros(len(non_member_predictions))
# Store results for attack model training
shadow_model_outputs.append(np.concatenate((member_predictions, non_member_predictions)))
shadow_model_labels.append(np.concatenate((member_labels, non_member_labels)))
# Combine results from all shadow models
attack_train_X = np.concatenate(shadow_model_outputs)
attack_train_y = np.concatenate(shadow_model_labels)
print(f"Generated attack training data: {attack_train_X.shape}, {attack_train_y.shape}")
This loop simulates the process of creating proxies for the target model. Each shadow model learns patterns on its own data, and its behavior on seen vs. unseen data provides the signal for the attack model.
Flow of a membership inference attack using shadow models. The attacker trains shadow models on their own data to generate training data for an attack model. This attack model then predicts membership based on the output of the target model for a specific query data point.
We can use a simple classifier like Logistic Regression or a small Multi-Layer Perceptron (MLP) from scikit-learn
as the attack model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
# Optional: Scale the features (model outputs)
scaler = StandardScaler()
attack_train_X_scaled = scaler.fit_transform(attack_train_X)
# Split attack data for evaluation
X_train_att, X_test_att, y_train_att, y_test_att = train_test_split(
attack_train_X_scaled, attack_train_y, test_size=0.3, stratify=attack_train_y, random_state=42
)
# Train the attack model
attack_model = LogisticRegression(solver='liblinear', random_state=42)
# Or use a small MLP:
# attack_model = MLPClassifier(hidden_layer_sizes=(64,), max_iter=300, random_state=42)
attack_model.fit(X_train_att, y_train_att)
# Evaluate on the held-out attack data split
y_pred_att = attack_model.predict(X_test_att)
print("Attack Model Performance (on shadow model data):")
print(f"Accuracy: {accuracy_score(y_test_att, y_pred_att):.4f}")
print(classification_report(y_test_att, y_pred_att))
This evaluation tells us how well the attack model learned to distinguish member vs. non-member outputs from the shadow models.
Now, the crucial step: use the trained attack_model
on the target model's outputs. To properly evaluate the attack's success against the target, we need ground truth knowledge about the membership of some data points relative to Dtarget_train. This is usually possible in a research setting where we control the target model's training, but not in a real black-box scenario.
Let's assume we have target_model
and access to its original training set target_train_data
and test set target_test_data
for evaluation purposes.
# Get target model outputs for known members (from its training set)
target_member_outputs = target_model.predict(target_train_data.x)
# Get target model outputs for known non-members (from its test set)
target_non_member_outputs = target_model.predict(target_test_data.x)
# Combine these outputs to form the test set for the attack model against the target
attack_test_target_X = np.concatenate((target_member_outputs, target_non_member_outputs))
attack_test_target_y = np.concatenate([np.ones(len(target_member_outputs)), np.zeros(len(target_non_member_outputs))])
# Scale using the same scaler fitted on shadow model outputs
attack_test_target_X_scaled = scaler.transform(attack_test_target_X) # Use transform, not fit_transform
# Predict membership using the attack model
final_predictions = attack_model.predict(attack_test_target_X_scaled)
final_probabilities = attack_model.predict_proba(attack_test_target_X_scaled)[:, 1] # Probability of being a member
# Evaluate the attack success on the target model
print("\nMembership Inference Attack Performance (on target model data):")
print(f"Accuracy: {accuracy_score(attack_test_target_y, final_predictions):.4f}")
print(classification_report(attack_test_target_y, final_predictions))
# You can also calculate AUC using final_probabilities and attack_test_target_y
# from sklearn.metrics import roc_auc_score
# print(f"AUC: {roc_auc_score(attack_test_target_y, final_probabilities):.4f}")
The key metric here is the final accuracy (or AUC) of the attack_model
on the target model's data.
High precision for the "member" class means that when the attack predicts "member", it's likely correct. High recall means the attack identifies a large fraction of the true members.
Example evaluation results showing an attack performing noticeably better than random guessing, indicating potential membership information leakage from the target model.
This hands-on perspective demonstrates that membership inference is not just a theoretical concern. With appropriate access and data, attackers can build models to probe the training history of machine learning systems, raising significant privacy questions. Understanding how to implement these attacks is the first step towards evaluating model vulnerabilities and designing effective defenses.
© 2025 ApX Machine Learning