Implementing a Train-Synthetic-Test-Real (TSTR) evaluation workflow involves using Python and common data science libraries. The objective is to train a machine learning model solely on synthetic data and evaluate its performance on unseen real data, providing a direct measure of the synthetic data's utility for downstream tasks.Setting Up the EvaluationFirst, we need our datasets: the original real dataset and the synthetic dataset generated from it. For a fair evaluation, we must partition the real data appropriately. A portion will be used to train a baseline model (Train-Real-Test-Real or TRTR), and a separate, held-out portion will serve as the common test set for both the baseline model and the TSTR model. It's absolutely essential that this real test set was not used in any way during the generation of the synthetic data.Let's assume you have your real data in a Pandas DataFrame called real_df and your synthetic data in synthetic_df. Both should contain features and a target variable (let's call it 'target').import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, f1_score, roc_auc_score import numpy as np # Assume real_df and synthetic_df are pre-loaded Pandas DataFrames # Example structure: # real_df = pd.DataFrame(np.random.rand(1000, 5), columns=[f'feat_{i}' for i in range(5)]) # real_df['target'] = np.random.randint(0, 2, 1000) # synthetic_df = pd.DataFrame(np.random.rand(1000, 5), columns=[f'feat_{i}' for i in range(5)]) # synthetic_df['target'] = np.random.randint(0, 2, 1000) # Separate features (X) and target (y) for real data X_real = real_df.drop('target', axis=1) y_real = real_df['target'] # Split the REAL data into training and testing sets # The test set (X_test_real, y_test_real) is reserved for evaluation ONLY. X_train_real, X_test_real, y_train_real, y_test_real = train_test_split( X_real, y_real, test_size=0.3, random_state=42, stratify=y_real ) # Prepare synthetic data (assuming it has the same columns) X_synthetic = synthetic_df.drop('target', axis=1) y_synthetic = synthetic_df['target'] print(f"Real training set size: {X_train_real.shape[0]}") print(f"Real test set size: {X_test_real.shape[0]}") print(f"Synthetic training set size: {X_synthetic.shape[0]}")Ensure that any preprocessing steps (like scaling or encoding) applied to the real training data are also applied consistently to the synthetic data before training the TSTR model, and to the real test data before evaluation.Training the Baseline Model (TRTR)To understand how well the synthetic data performs, we need a benchmark. This benchmark is established by training a model on the real training data and evaluating it on the real test data. We'll use a standard classifier, like a RandomForest, for this example.# Initialize the classifier # Use the same model type and hyperparameters for both TRTR and TSTR model_params = {'n_estimators': 100, 'max_depth': 10, 'random_state': 42} baseline_model = RandomForestClassifier(**model_params) # Train the baseline model on REAL data baseline_model.fit(X_train_real, y_train_real) # Evaluate the baseline model on the REAL test set baseline_predictions = baseline_model.predict(X_test_real) baseline_probabilities = baseline_model.predict_proba(X_test_real)[:, 1] # Prob for AUC # Calculate baseline performance metrics baseline_accuracy = accuracy_score(y_test_real, baseline_predictions) baseline_f1 = f1_score(y_test_real, baseline_predictions) baseline_auc = roc_auc_score(y_test_real, baseline_probabilities) print("\nBaseline Model Performance (Train Real, Test Real):") print(f"Accuracy: {baseline_accuracy:.4f}") print(f"F1 Score: {baseline_f1:.4f}") print(f"AUC: {baseline_auc:.4f}") # Store results for comparison results = { 'TRTR': {'Accuracy': baseline_accuracy, 'F1': baseline_f1, 'AUC': baseline_auc} }These TRTR scores represent the performance achievable using the original data under the chosen modeling setup.Training the TSTR ModelNow, we perform the core TSTR step. We train an identical model (same type, same hyperparameters) but use the synthetic data for training. The evaluation, importantly, still happens on the real test set (X_test_real, y_test_real).# Initialize an identical classifier for TSTR tstr_model = RandomForestClassifier(**model_params) # Train the TSTR model on SYNTHETIC data # Ensure X_synthetic has the same features as X_train_real tstr_model.fit(X_synthetic, y_synthetic) # Evaluate the TSTR model on the REAL test set tstr_predictions = tstr_model.predict(X_test_real) tstr_probabilities = tstr_model.predict_proba(X_test_real)[:, 1] # Prob for AUC # Calculate TSTR performance metrics tstr_accuracy = accuracy_score(y_test_real, tstr_predictions) tstr_f1 = f1_score(y_test_real, tstr_predictions) tstr_auc = roc_auc_score(y_test_real, tstr_probabilities) print("\nTSTR Model Performance (Train Synthetic, Test Real):") print(f"Accuracy: {tstr_accuracy:.4f}") print(f"F1 Score: {tstr_f1:.4f}") print(f"AUC: {tstr_auc:.4f}") # Store results for comparison results['TSTR'] = {'Accuracy': tstr_accuracy, 'F1': tstr_f1, 'AUC': tstr_auc}Comparing Results and InterpretationThe final step is to compare the performance metrics from the TRTR baseline and the TSTR evaluation. The closer the TSTR metrics are to the TRTR metrics, the higher the machine learning utility of the synthetic data.# Create a DataFrame for easy comparison results_df = pd.DataFrame(results).T # Transpose for better readability print("\nComparison of Model Performance:") print(results_df) # Calculate performance difference (e.g., TSTR score / TRTR score) performance_ratio = results_df.loc['TSTR'] / results_df.loc['TRTR'] print("\nPerformance Ratio (TSTR Score / TRTR Score):") print(performance_ratio)A performance ratio close to 1.0 indicates excellent utility; the model trained on synthetic data performs almost as well as the one trained on real data. Ratios significantly below 1.0 (e.g., < 0.8) suggest the synthetic data lacks important patterns present in the real data, limiting its usefulness for this specific task and model. Moderate ratios (e.g., 0.8 - 0.95) might be acceptable depending on the application's tolerance and the benefits gained from using synthetic data (like privacy).We can visualize this comparison:{"layout": {"title": "TSTR vs. TRTR Model Performance Comparison", "barmode": "group", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Score", "range": [0, 1]}, "legend": {"title": "Training Data"}, "autosize": true}, "data": [{"type": "bar", "name": "TRTR (Real Data)", "x": ["Accuracy", "F1 Score", "AUC"], "y": [0.85, 0.82, 0.90], "marker": {"color": "#1c7ed6"}}, {"type": "bar", "name": "TSTR (Synthetic Data)", "x": ["Accuracy", "F1 Score", "AUC"], "y": [0.78, 0.75, 0.84], "marker": {"color": "#ff922b"}}]}Comparison of performance metrics (Accuracy, F1 Score, AUC) for models trained on real data (TRTR) versus synthetic data (TSTR), both evaluated on the same real test set. Lower TSTR bars indicate reduced utility.Further NotesModel Choice: The choice of model (RandomForest in this case) can influence results. It's often beneficial to repeat the TSTR evaluation with different types of models relevant to your actual downstream application (e.g., Logistic Regression, Gradient Boosting, Neural Networks).Hyperparameters: We used fixed hyperparameters. As discussed previously, synthetic data might lead to different optimal hyperparameters compared to real data. Running hyperparameter optimization separately for the TRTR and TSTR scenarios can provide additional insights but adds complexity to the comparison.Data Size: The relative size of the synthetic dataset compared to the real training set can impact performance. Ensure you are comparing reasonably sized datasets.Task Complexity: The utility might vary depending on the complexity of the prediction task (e.g., binary classification vs. multi-class classification vs. regression).This practical exercise demonstrates the core TSTR workflow. By systematically comparing the performance of models trained on synthetic versus real data, you gain quantitative evidence about the synthetic data's practical value for your machine learning objectives. Remember to adapt the model, metrics, and interpretation based on your specific use case.