After assessing the statistical similarity between real and synthetic datasets, we turn to a critical practical question: How useful is the synthetic data for training actual machine learning models? This chapter focuses on quantifying this machine learning utility.
You will learn standard evaluation approaches, primarily the Train-Synthetic-Test-Real (TSTR) methodology. In TSTR, a model is trained using only synthetic data and then evaluated on a held-out set of real data. We will also examine the complementary Train-Real-Test-Synthetic (TRTS) approach.
Key techniques covered include:
The goal is to provide objective measures to determine if the synthetic data can effectively stand in for real data in your downstream machine learning applications. This involves hands-on practice implementing these evaluation workflows.
3.1 Train-Synthetic-Test-Real (TSTR) Methodology
3.2 Train-Real-Test-Synthetic (TRTS) Methodology
3.3 Comparing Downstream Model Performance Metrics
3.4 Assessing Feature Importance Consistency
3.5 Hyperparameter Optimization Effects
3.6 Hands-on practical: Running TSTR Evaluations
© 2025 ApX Machine Learning