While Membership Inference Attacks (MIAs) attempt to directly determine if a record was part of the training set, distance-based metrics offer a different, complementary perspective on privacy. They quantify the proximity between synthetic records and original, real records. The underlying assumption is straightforward: if a synthetic data point is extremely close to an actual data point from the training set, it poses a higher privacy risk. Such proximity might allow an adversary to infer information about the real individual corresponding to that nearby record, or even facilitate re-identification.
These metrics don't necessarily require training complex attack models like MIAs; instead, they rely on geometric or feature-space distances.
One fundamental distance-based metric is the Distance to Closest Record (DCR). For each record in the synthetic dataset (S), we calculate its distance to every record in the original, real dataset (R). The DCR for a synthetic record s∈S is the minimum of these distances:
DCR(s)=r∈Rmindistance(s,r)Here, distance(s,r) can be any suitable distance function, such as:
The choice of distance metric is important and depends heavily on the nature of your data and the features involved. Remember to appropriately scale numerical features before calculating distances to prevent features with larger ranges from dominating the result.
A single DCR value isn't very informative on its own. We are more interested in the distribution of DCR values across all synthetic records. Critically, we compare this distribution to a baseline: the distribution of distances to the closest record within the real dataset itself. That is, for each real record r∈R, calculate its distance to the nearest other real record r′∈R,r′=r.
Let's call the DCR for synthetic records DCRSyn and the within-real DCR DCRReal.
The following plot illustrates a scenario where some synthetic points are dangerously close to real points.
Scatter plot showing real (blue circles) and synthetic (red crosses) data points. The orange dotted lines highlight synthetic points with very small distances (DCR) to their nearest real neighbors, indicating potential privacy concerns.
Another related metric is the Nearest Neighbor Distance Ratio (NNDR). It aims to normalize the DCR by considering the typical separation between points in the real dataset. For a synthetic record s, the NNDR is often calculated as:
NNDR(s)=average distance to k nearest neighbors within Rdistance(s,NNR(s))Where NNR(s) is the nearest neighbor of s in the real dataset R. The denominator represents the average distance between close points in the original data, providing context to the numerator (which is essentially the DCR).
A low NNDR suggests that a synthetic point is closer to its nearest real neighbor than typical real points are to their neighbors, again raising a potential privacy flag.
While distance-based metrics provide valuable insights, keep these points in mind:
Distance-based metrics serve as useful heuristics and sanity checks for privacy. They are often used alongside MIAs and attribute inference assessments to build a more complete picture of the privacy characteristics of a synthetic dataset. They are particularly good at catching instances of near-verbatim copying or minimal perturbation of original records.
© 2025 ApX Machine Learning