Now that we understand the theoretical underpinnings of the Train-Synthetic-Test-Real (TSTR) methodology, let's put it into practice. This hands-on section guides you through implementing a TSTR evaluation workflow using Python and common data science libraries. The objective is to train a machine learning model solely on synthetic data and evaluate its performance on unseen real data, providing a direct measure of the synthetic data's utility for downstream tasks.
First, we need our datasets: the original real dataset and the synthetic dataset generated from it. For a fair evaluation, we must partition the real data appropriately. A portion will be used to train a baseline model (Train-Real-Test-Real or TRTR), and a separate, held-out portion will serve as the common test set for both the baseline model and the TSTR model. It's absolutely essential that this real test set was not used in any way during the generation of the synthetic data.
Let's assume you have your real data in a Pandas DataFrame called real_df
and your synthetic data in synthetic_df
. Both should contain features and a target variable (let's call it 'target'
).
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import numpy as np
# Assume real_df and synthetic_df are pre-loaded Pandas DataFrames
# Example structure:
# real_df = pd.DataFrame(np.random.rand(1000, 5), columns=[f'feat_{i}' for i in range(5)])
# real_df['target'] = np.random.randint(0, 2, 1000)
# synthetic_df = pd.DataFrame(np.random.rand(1000, 5), columns=[f'feat_{i}' for i in range(5)])
# synthetic_df['target'] = np.random.randint(0, 2, 1000)
# Separate features (X) and target (y) for real data
X_real = real_df.drop('target', axis=1)
y_real = real_df['target']
# Split the REAL data into training and testing sets
# The test set (X_test_real, y_test_real) is reserved for evaluation ONLY.
X_train_real, X_test_real, y_train_real, y_test_real = train_test_split(
X_real, y_real, test_size=0.3, random_state=42, stratify=y_real
)
# Prepare synthetic data (assuming it has the same columns)
X_synthetic = synthetic_df.drop('target', axis=1)
y_synthetic = synthetic_df['target']
print(f"Real training set size: {X_train_real.shape[0]}")
print(f"Real test set size: {X_test_real.shape[0]}")
print(f"Synthetic training set size: {X_synthetic.shape[0]}")
Ensure that any preprocessing steps (like scaling or encoding) applied to the real training data are also applied consistently to the synthetic data before training the TSTR model, and to the real test data before evaluation.
To understand how well the synthetic data performs, we need a benchmark. This benchmark is established by training a model on the real training data and evaluating it on the real test data. We'll use a standard classifier, like a RandomForest, for this example.
# Initialize the classifier
# Use the same model type and hyperparameters for both TRTR and TSTR
model_params = {'n_estimators': 100, 'max_depth': 10, 'random_state': 42}
baseline_model = RandomForestClassifier(**model_params)
# Train the baseline model on REAL data
baseline_model.fit(X_train_real, y_train_real)
# Evaluate the baseline model on the REAL test set
baseline_predictions = baseline_model.predict(X_test_real)
baseline_probabilities = baseline_model.predict_proba(X_test_real)[:, 1] # Prob for AUC
# Calculate baseline performance metrics
baseline_accuracy = accuracy_score(y_test_real, baseline_predictions)
baseline_f1 = f1_score(y_test_real, baseline_predictions)
baseline_auc = roc_auc_score(y_test_real, baseline_probabilities)
print("\nBaseline Model Performance (Train Real, Test Real):")
print(f"Accuracy: {baseline_accuracy:.4f}")
print(f"F1 Score: {baseline_f1:.4f}")
print(f"AUC: {baseline_auc:.4f}")
# Store results for comparison
results = {
'TRTR': {'Accuracy': baseline_accuracy, 'F1': baseline_f1, 'AUC': baseline_auc}
}
These TRTR scores represent the performance achievable using the original data under the chosen modeling setup.
Now, we perform the core TSTR step. We train an identical model (same type, same hyperparameters) but use the synthetic data for training. The evaluation, importantly, still happens on the real test set (X_test_real
, y_test_real
).
# Initialize an identical classifier for TSTR
tstr_model = RandomForestClassifier(**model_params)
# Train the TSTR model on SYNTHETIC data
# Ensure X_synthetic has the same features as X_train_real
tstr_model.fit(X_synthetic, y_synthetic)
# Evaluate the TSTR model on the REAL test set
tstr_predictions = tstr_model.predict(X_test_real)
tstr_probabilities = tstr_model.predict_proba(X_test_real)[:, 1] # Prob for AUC
# Calculate TSTR performance metrics
tstr_accuracy = accuracy_score(y_test_real, tstr_predictions)
tstr_f1 = f1_score(y_test_real, tstr_predictions)
tstr_auc = roc_auc_score(y_test_real, tstr_probabilities)
print("\nTSTR Model Performance (Train Synthetic, Test Real):")
print(f"Accuracy: {tstr_accuracy:.4f}")
print(f"F1 Score: {tstr_f1:.4f}")
print(f"AUC: {tstr_auc:.4f}")
# Store results for comparison
results['TSTR'] = {'Accuracy': tstr_accuracy, 'F1': tstr_f1, 'AUC': tstr_auc}
The final step is to compare the performance metrics from the TRTR baseline and the TSTR evaluation. The closer the TSTR metrics are to the TRTR metrics, the higher the machine learning utility of the synthetic data.
# Create a DataFrame for easy comparison
results_df = pd.DataFrame(results).T # Transpose for better readability
print("\nComparison of Model Performance:")
print(results_df)
# Calculate performance difference (e.g., TSTR score / TRTR score)
performance_ratio = results_df.loc['TSTR'] / results_df.loc['TRTR']
print("\nPerformance Ratio (TSTR Score / TRTR Score):")
print(performance_ratio)
A performance ratio close to 1.0 indicates excellent utility; the model trained on synthetic data performs almost as well as the one trained on real data. Ratios significantly below 1.0 (e.g., < 0.8) suggest the synthetic data lacks important patterns present in the real data, limiting its usefulness for this specific task and model. Moderate ratios (e.g., 0.8 - 0.95) might be acceptable depending on the application's tolerance and the benefits gained from using synthetic data (like privacy).
We can visualize this comparison:
Comparison of performance metrics (Accuracy, F1 Score, AUC) for models trained on real data (TRTR) versus synthetic data (TSTR), both evaluated on the same real test set. Lower TSTR bars indicate reduced utility.
This practical exercise demonstrates the core TSTR workflow. By systematically comparing the performance of models trained on synthetic versus real data, you gain quantitative evidence about the synthetic data's practical value for your machine learning objectives. Remember to adapt the model, metrics, and interpretation based on your specific use case.
© 2025 ApX Machine Learning