While comparing marginal distributions and basic statistics offers a first look at fidelity, as discussed in Chapter 1, it falls short of verifying if the synthetic data truly captures the complex interplay between variables present in the real data. We need methods that assess the similarity of the joint probability distributions, Preal(X1,...,Xn) and Psynthetic(X1,...,Xn). Hypothesis testing provides a formal statistical framework for this assessment.
In this context, the typical setup involves:
The goal of the test is to determine if there is sufficient statistical evidence to reject H0 in favor of HA. A small p-value (typically below a predefined significance level, e.g., 0.05) suggests that the observed differences between the datasets are unlikely to have occurred by chance if they truly came from the same distribution, leading us to reject H0 and conclude the distributions are different.
Several statistical tests can be adapted or specifically designed for comparing two multivariate samples. Let's examine some prominent techniques suitable for synthetic data evaluation.
Classical tests like the Kolmogorov-Smirnov (KS) test and the Chi-Squared (χ2) test are well-established for univariate distribution comparison.
Kolmogorov-Smirnov (KS) Test: The standard 1D KS test compares the cumulative distribution functions (CDFs) of two samples. Its test statistic measures the maximum absolute difference between the empirical CDFs. While powerful for univariate data, extending it directly to multiple dimensions (>1) is problematic. Defining a unique multivariate CDF and calculating the maximum difference becomes complex, and the tests often lose statistical power rapidly as dimensionality increases. Practical approaches sometimes involve applying 1D KS tests to projections of the data (e.g., principal components) or marginal distributions, but this doesn't fully capture the joint structure.
Chi-Squared (χ2) Test: The Chi-Squared goodness-of-fit test or the test for homogeneity can compare distributions by partitioning the data space into bins and comparing the observed frequencies of data points falling into each bin. However, its effectiveness diminishes significantly in higher dimensions due to the "curse of dimensionality". To maintain a reasonable number of samples per bin, the number of bins required grows exponentially with the number of features, making it impractical for most multivariate datasets unless dimensionality is very low or data is inherently categorical. The choice of binning strategy also heavily influences the results.
Given these limitations, methods specifically designed for multivariate comparisons are often preferred for evaluating synthetic data fidelity.
Maximum Mean Discrepancy (MMD) is a non-parametric metric that measures the distance between distributions based on the mean embeddings of samples in a Reproducing Kernel Hilbert Space (RKHS).
The intuition is to map the data points from each distribution into a potentially infinite-dimensional feature space using a kernel function k(⋅,⋅) (e.g., Gaussian RBF kernel). MMD then calculates the squared distance between the means of the mapped samples in this RKHS.
Let X={x1,...,xm} be samples from Preal and Y={y1,...,yn} be samples from Psynthetic. An empirical estimate of the squared MMD is given by:
MMD2(X,Y)=m21i=1∑mj=1∑mk(xi,xj)−mn2i=1∑mj=1∑nk(xi,yj)+n21i=1∑nj=1∑nk(yi,yj)If Preal=Psynthetic, then MMD(Preal,Psynthetic)=0. Larger values indicate greater dissimilarity.
Hypothesis Testing with MMD: A hypothesis test can be constructed using the calculated MMD value as the test statistic. The significance (p-value) is often determined using permutation testing:
MMD is advantageous because it doesn't require binning, works well in high dimensions, and can capture complex differences depending on the chosen kernel. The choice of kernel and its parameters (e.g., the bandwidth γ for an RBF kernel) is important and can affect sensitivity.
Energy distance provides another powerful non-parametric framework for testing multivariate distributional equality. It's based on the statistical energy concept, related to Newton's gravitational potential energy. The E-statistic compares the distances between points from the two different samples to the distances within each sample.
Let X={x1,...,xm} from Preal and Y={y1,...,yn} from Psynthetic. The E-statistic (related to squared energy distance) is calculated based on Euclidean distances ∣∣⋅∣∣:
E(X,Y)=mn2i=1∑mj=1∑n∣∣xi−yj∣∣−m21i=1∑mj=1∑m∣∣xi−xj∣∣−n21i=1∑nj=1∑n∣∣yi−yj∣∣Similar to MMD, E(X,Y)≥0, and E(X,Y)=0 if and only if Preal=Psynthetic (under moment conditions). Larger values imply greater distributional differences.
Hypothesis Testing with Energy Distance: Permutation testing is also commonly used here. The E-statistic E(X,Y) is calculated for the original samples, and its value is compared against the distribution of E-statistics obtained from permuted samples to compute a p-value.
Energy distance is rotation invariant and doesn't require kernel selection, making it somewhat simpler to apply than MMD in some cases, while still being effective in high dimensions.
A very intuitive approach to testing if two samples come from the same distribution is to see if a machine learning classifier can reliably distinguish between them. This is the core idea behind Classifier Two-Sample Tests (C2ST).
Methodology:
Hypothesis Testing with C2ST: The classifier's performance on the test set serves as the test statistic.
The p-value can be estimated using permutation tests on the labels or by analyzing the distribution of the performance metric under the null hypothesis. A high accuracy or AUC significantly greater than 0.5 provides evidence to reject H0.
Diagram illustrating the workflow of a Classifier Two-Sample Test (C2ST). Real and synthetic data are labeled, combined, split, and used to train and evaluate a classifier. High classification performance suggests the distributions are different.
C2ST is flexible as various classifiers can be used. Furthermore, if the classifier performs well, feature importance analysis can sometimes provide insights into which features or interactions differ most significantly between the real and synthetic datasets.
When using hypothesis tests for distributional similarity:
Hypothesis tests offer a rigorous way to move beyond simple statistics and assess the fidelity of the joint distribution. They provide quantitative evidence but should be interpreted carefully, considering sample size, effect size, and the overall evaluation goals.
© 2025 ApX Machine Learning