As highlighted in Chapter 1, evaluating synthetic data solely based on individual feature statistics like mean (μ) or standard deviation (σ) provides an incomplete picture. While matching the marginal distribution P(Xi) for each feature Xi is a necessary first step, it doesn't guarantee that the complex interplay between features is preserved. Real-world datasets are characterized by intricate dependencies and correlations forming a multivariate structure. A high-quality synthetic dataset must replicate this joint distribution, P(X1,X2,...,Xn), not just the individual components. This section introduces methods for assessing this multivariate similarity.
Simply put, we need to check if the relationships between variables in the synthetic data mirror those in the real data. Do features that tend to increase together in the real data also do so in the synthetic data? Are complex, non-linear dependencies captured? Answering these questions requires moving beyond univariate comparisons.
Visual Exploration of Multivariate Structure
Before diving into quantitative metrics, visual inspection is often a valuable starting point, especially for datasets with a relatively small number of dimensions (n).
- Scatter Plots & Pair Plots: For two variables (Xi,Xj), a simple scatter plot can reveal linear or non-linear relationships, clusters, and outliers. Extending this, a pair plot (also known as a scatter plot matrix) displays pairwise scatter plots for all combinations of selected features, along with the marginal distribution of each feature on the diagonal. Comparing the pair plot of the real data to that of the synthetic data gives a qualitative sense of how well pairwise relationships are preserved. Libraries like Seaborn in Python make generating these plots straightforward.
A pair plot comparing three features from a real dataset (blue) and a synthetic dataset (orange). The diagonal shows histograms of marginal distributions, while off-diagonals show scatter plots for pairwise relationships. Ideally, the shapes and trends in the synthetic plots should closely resemble those in the real plots.
- Dimensionality Reduction: For datasets with many features (n>3), visualizing the full joint distribution directly is impossible. Dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) project the high-dimensional data onto a 2D or 3D space. By applying the same projection (learned from the real data or combined data) to both real and synthetic datasets and plotting the results, you can visually inspect whether the overall structure, clusters, and manifolds are preserved. If the synthetic points overlay well with the real points in the low-dimensional embedding, it suggests good structural similarity.
Quantitative Multivariate Comparison Approaches
Visualizations are insightful but subjective. Quantitative methods provide objective scores for multivariate similarity:
-
Distance and Divergence Metrics: Several mathematical measures exist to quantify the "distance" or "divergence" between two probability distributions.
- Mahalanobis Distance: Measures the distance between a point and a distribution, considering the covariance structure. It can be adapted to compare the centroids of two multivariate distributions.
- Kullback-Leibler (KL) Divergence & Jensen-Shannon (JS) Divergence: These information-theoretic measures quantify the difference between two probability distributions. JS divergence is symmetric and always finite, making it often preferred over KL divergence. Computing these accurately in high dimensions can be challenging and may require density estimation techniques.
- Maximum Mean Discrepancy (MMD): A kernel-based method that measures the distance between the embeddings of the two datasets in a high-dimensional Reproducing Kernel Hilbert Space (RKHS). It avoids direct density estimation and often works well in high dimensions.
-
Hypothesis Testing: Statistical tests can formally assess the null hypothesis that the real and synthetic samples are drawn from the same underlying multivariate distribution. Examples include:
- Hotelling's T-squared test: A multivariate generalization of the t-test for comparing the means of two multivariate samples.
- Kernel-based tests (e.g., using MMD): Non-parametric tests based on metrics like MMD.
We will examine hypothesis testing in more detail in the next section.
-
Discriminator-Based Evaluation: This approach borrows ideas from Generative Adversarial Networks (GANs). Train a classifier (e.g., a logistic regression model, an SVM, or a neural network) to distinguish between samples from the real dataset and samples from the synthetic dataset. If the classifier struggles to differentiate them (i.e., its accuracy is close to 50% for balanced datasets), it implies the distributions are statistically similar. The Propensity Score method, discussed later in this chapter, is a specific instance of this approach.
Challenges in Multivariate Comparisons
Evaluating multivariate distributions is inherently more complex than univariate comparisons:
- Curse of Dimensionality: As the number of features (n) increases, the volume of the data space grows exponentially. Data points become sparse, making it harder to estimate densities or accurately compare distributions. Visualizations become less effective.
- Computational Cost: Many multivariate methods (e.g., kernel methods, complex hypothesis tests) can be computationally expensive, especially with large datasets or high dimensions.
- Interpretability: A single distance or test statistic (like MMD or a p-value) summarizes the overall difference but might not reveal how the distributions differ (e.g., which specific correlations are mismatched).
Despite these challenges, assessing multivariate fidelity is indispensable for ensuring synthetic data truly captures the essence of the original data. The subsequent sections will delve into specific techniques like hypothesis testing, correlation analysis, information-theoretic measures, and propensity scores, providing you with the tools to perform these rigorous evaluations.