Distance-based metrics offer a distinct perspective on privacy assessment. Whereas Membership Inference Attacks (MIAs) attempt to directly determine if a record was part of the training set, distance-based metrics quantify the proximity between synthetic records and original, real records. The underlying assumption is straightforward: if a synthetic data point is extremely close to an actual data point from the training set, it poses a higher privacy risk. This proximity might allow an adversary to infer information about the real individual corresponding to that nearby record, or even facilitate re-identification.
These metrics don't necessarily require training complex attack models like MIAs; instead, they rely on geometric or feature-space distances.
Distance to Closest Record (DCR)
One fundamental distance-based metric is the Distance to Closest Record (DCR). For each record in the synthetic dataset (S), we calculate its distance to every record in the original, real dataset (R). The DCR for a synthetic record s∈S is the minimum of these distances:
DCR(s)=r∈Rmindistance(s,r)
Here, distance(s,r) can be any suitable distance function, such as:
Manhattan Distance:∑i=1d∣si−ri∣ (often more effective against outliers)
Hamming Distance: Number of positions at which corresponding symbols are different (for categorical features).
Gower Distance: A hybrid measure that can handle mixed data types (numerical, categorical).
The choice of distance metric is important and depends heavily on the nature of your data and the features involved. Remember to appropriately scale numerical features before calculating distances to prevent features with larger ranges from dominating the result.
Interpreting DCR Values
A single DCR value isn't very informative on its own. We are more interested in the distribution of DCR values across all synthetic records. Critically, we compare this distribution to a baseline: the distribution of distances to the closest record within the real dataset itself. That is, for each real record r∈R, calculate its distance to the nearest other real record r′∈R,r′=r.
Let's call the DCR for synthetic records DCRSyn and the within-real DCR DCRReal.
Low DCRSyn values: If many synthetic records have very small DCR values (significantly smaller than the typical DCRReal values), it suggests potential privacy leakage. These synthetic records might be near-copies or slight perturbations of real records.
Similar Distributions: If the distribution of DCRSyn closely mimics the distribution of DCRReal, it can indicate that the synthetic data maintains a similar level of "uniqueness" or separation between points as the original data. This is often a desirable outcome from a privacy perspective using this metric.
High DCRSyn values: If synthetic records are generally much farther from real records than real records are from each other, it might imply lower privacy risk in terms of direct record replication, but could also correlate with lower data fidelity or utility.
The following plot illustrates a scenario where some synthetic points are dangerously close to real points.
Scatter plot showing real (blue circles) and synthetic (red crosses) data points. The orange dotted lines highlight synthetic points with very small distances (DCR) to their nearest real neighbors, indicating potential privacy concerns.
Nearest Neighbor Distance Ratio (NNDR)
Another related metric is the Nearest Neighbor Distance Ratio (NNDR). It aims to normalize the DCR by considering the typical separation between points in the real dataset. For a synthetic record s, the NNDR is often calculated as:
NNDR(s)=average distance to k nearest neighbors within Rdistance(s,NNR(s))
Where NNR(s) is the nearest neighbor of s in the real dataset R. The denominator represents the average distance between close points in the original data, providing context to the numerator (which is essentially the DCR).
A low NNDR suggests that a synthetic point is closer to its nearest real neighbor than typical real points are to their neighbors, again raising a potential privacy flag.
Limitations and Challenges
While distance-based metrics provide valuable insights, keep these points in mind:
Curse of Dimensionality: In high-dimensional spaces, the concept of "closeness" becomes less intuitive. Distances between all points tend to become more uniform, potentially masking true proximity risks. Feature selection or dimensionality reduction might be necessary before applying these metrics effectively.
Choice of Distance Metric: The results are sensitive to the chosen distance function (Euclidean, Manhattan, Gower, etc.). Select one appropriate for your data types and expected data geometry.
Computational Cost: Calculating all pairwise distances between large synthetic and real datasets can be computationally expensive (O(∣S∣×∣R∣×d) for basic implementations, where d is the number of dimensions). Efficient nearest neighbor search algorithms (like k-d trees or locality-sensitive hashing) can help but may introduce approximations.
Scaling: Numerical features should generally be scaled (e.g., using standardization or min-max scaling) before distance calculation to ensure features are weighted appropriately.
Interpretation Thresholds: There are no universal "safe" thresholds for DCR or NNDR. Interpretation often involves comparing the distribution for synthetic data against the baseline distribution from the real data. Significant deviations, particularly towards very small distances for synthetic points, warrant further investigation.
Distance-based metrics serve as useful heuristics and sanity checks for privacy. They are often used alongside MIAs and attribute inference assessments to build a more complete picture of the privacy characteristics of a synthetic dataset. They are particularly good at catching instances of near-verbatim copying or minimal perturbation of original records.
Was this section helpful?
Measuring the Privacy of Synthetic Data, Jyoti D. Shringarpure and C. P. B. M. de Rooij, 2021Transactions on Data Privacy, Vol. 14 (De Gruyter)DOI: 10.2478/tdp-2021-0004 - This paper provides a foundational discussion and formalization of distance-based privacy metrics like Distance to Closest Record (DCR) for synthetic data, offering detailed methodology and interpretation.
An Evaluation of Data Synthesis Methods for Privacy Protection, Simon Beaulieu, Andrew C. T. C. Yip, Andrew G. L. Wong, 2021 (Statistics Canada) - This technical report evaluates various data synthesis methods and extensively uses distance-based metrics, including DCR and NNDR, to assess their privacy implications in practical settings.
A survey on data synthesis: From techniques to applications, Bingbing Liu, Zhaokun Wang, Zhen Huang, Jianliang Xu, and Yunjun Gao, 2022ACM Computing Surveys, Vol. 55DOI: 10.1145/3547285 - This comprehensive survey provides a broad overview of data synthesis techniques and dedicated sections on privacy evaluation, contextualizing distance-based metrics within a wider range of privacy assessment approaches.