While comparing marginal distributions and basic statistics gives us initial clues about synthetic data fidelity, as discussed earlier, these methods often fall short in capturing the intricate relationships within high-dimensional datasets. We need techniques that assess whether the overall structure and joint distributions of the synthetic data truly mimic the real data. Propensity score evaluation offers a powerful approach to precisely measure this overall statistical similarity.
The core idea is surprisingly straightforward: Can we train a machine learning model to reliably distinguish between real and synthetic data points based on their features? If the two datasets are statistically similar, any classifier should struggle to tell them apart. Conversely, if the synthetic data significantly differs from the real data in its patterns and distributions, a classifier should easily learn to separate them.
The Propensity Score Method
- Combine Datasets: Create a new dataset by concatenating the real dataset (Dreal) and the synthetic dataset (Dsynth).
- Add a Source Label: Introduce a binary target variable, let's call it
source
, where source = 0
for records originating from Dreal and source = 1
for records from Dsynth.
- Train a Classifier: Train a standard classification model (e.g., Logistic Regression, Gradient Boosting Machine like LightGBM or XGBoost, Random Forest) on this combined dataset. The features are the original data columns, and the target is the
source
label.
- Evaluate the Classifier: Assess the classifier's ability to distinguish between the real and synthetic data. Common metrics include Area Under the Receiver Operating Characteristic Curve (AUC) or accuracy.
The output probability from the classifier for a given data point x, P(source=1∣x), is the propensity score. It represents the model's estimated probability that the data point x belongs to the synthetic dataset.
Interpreting the Results
The performance of this "distinguisher" model serves as a direct measure of statistical fidelity:
- High Performance (AUC ≈ 1.0): If the classifier achieves high accuracy or an AUC close to 1.0, it means it can easily separate the real and synthetic data points. This indicates poor statistical fidelity. The synthetic data generation process failed to capture the underlying structure of the real data accurately, leaving discernible differences that the model exploits.
- Low Performance (AUC ≈ 0.5): If the classifier performs poorly, with an accuracy near 50% or an AUC close to 0.5 (equivalent to random guessing on a balanced dataset), it implies that the real and synthetic data points are statistically indistinguishable based on their features. This indicates good statistical fidelity. The synthetic data successfully mimics the characteristics and joint distributions of the real data.
A score between 0.5 and 1.0 reflects intermediate fidelity. An AUC of 0.7, for instance, suggests some dissimilarities exist but they might be less pronounced than in a case with an AUC of 0.9. The acceptable threshold often depends on the specific application requirements.
Visualizing Propensity Scores
Beyond the single AUC metric, visualizing the distribution of the predicted propensity scores for both the real and synthetic samples provides further insight. Ideally, the two distributions should heavily overlap. Significant separation between the distributions highlights systematic differences learned by the classifier.
Overlapping histograms of propensity scores for real and synthetic data. Significant overlap, centered around 0.5, suggests good statistical fidelity.
If the distributions were separated, with real data scores clustering near 0 and synthetic data scores clustering near 1, it would visually confirm poor fidelity.
Advantages and Considerations
Advantages:
- Holistic Assessment: Provides a single metric (AUC) summarizing the overall statistical similarity across all features and their interactions.
- Multivariate Sensitivity: Implicitly captures differences in complex multivariate relationships, as the classifier uses all features simultaneously.
- Model Agnostic: Can leverage various classification algorithms.
Considerations:
- Classifier Choice: The choice of classifier matters. A simple model like Logistic Regression might miss subtle non-linear differences, while a highly complex model (e.g., a deep neural network) might slightly overfit or find minor, perhaps unimportant, differences, leading to an overly pessimistic assessment. Using well-regularized, standard models like Gradient Boosting or Random Forest is often a good balance.
- Data Preprocessing: Standard preprocessing steps like feature scaling (e.g., standardization) are generally recommended for optimal classifier performance.
- Interpretation: While AUC ≈ 0.5 is the ideal target, practical results need context. Compare propensity scores across different synthetic datasets generated using different methods or parameters.
- Diagnostic Limitations: A high AUC tells you the datasets are different, but not how or where they differ. It doesn't pinpoint specific features or correlations that are poorly replicated. Therefore, propensity score evaluation complements, rather than replaces, other methods like direct distributional comparisons or correlation analysis.
Propensity score evaluation is a valuable tool in the advanced assessment toolkit. By framing the fidelity question as a classification problem, it offers a practical and interpretable way to quantify how well a synthetic dataset captures the overall statistical essence of its real counterpart. It's particularly useful for comparing different synthetic generation models or hyperparameter settings.