Creating synthetic data is one part of the process; ensuring its quality and suitability for machine learning tasks is another necessary step. Simply generating data isn't enough. We need methods to determine if the synthetic data accurately reflects the characteristics of real data and if it will be effective for training models.
This chapter focuses on the methods used to assess the generated data. You will learn why evaluation matters and examine several techniques. We will cover visual inspection using plots and basic statistical comparisons, such as checking if the mean (μ) or standard deviation (σ) of features in the synthetic dataset aligns with the original data. Additionally, we'll introduce methods for comparing data distributions, like using histograms or density plots. Finally, we will discuss the important difference between data fidelity (how closely the synthetic data resembles the real data) and utility (how effective the synthetic data is for a specific machine learning objective).
5.1 Importance of Evaluation
5.2 Visual Inspection Methods
5.3 Basic Statistical Comparisons
5.4 Comparing Data Distributions
5.5 Concept of Fidelity vs. Utility
© 2025 ApX Machine Learning