When synthetic data generation incorporates Differential Privacy (DP), the approach to privacy assessment takes on a distinct dimension. Differential Privacy provides a formal, mathematical guarantee about the privacy protection afforded to individuals in the original dataset. Unlike the empirical attack-based methods discussed earlier (like MIAs), DP offers a provable bound on privacy leakage before any specific attack is considered.
Understanding Differential Privacy Guarantees
At its heart, (ϵ,δ)-Differential Privacy ensures that the output of the synthetic data generation algorithm is statistically similar whether or not any single individual's data was included in the input. Formally, a randomized algorithm M satisfies (ϵ,δ)-DP if for any two adjacent datasets D1 and D2 (differing by only one individual's record), and for any set of possible outputs S:
P[M(D1)∈S]≤eϵP[M(D2)∈S]+δ
- Epsilon (ϵ): This is the privacy budget. A smaller ϵ value implies stronger privacy protection, meaning the presence or absence of a single individual has a very limited impact on the output distribution. An ϵ of 0 would mean the output is entirely independent of the input data (maximum privacy, likely zero utility). Typical values range from less than 1 (strong privacy) to around 10 (weaker privacy).
- Delta (δ): This parameter represents the probability that the pure ϵ-DP guarantee might be broken. It should ideally be a very small number, often less than 1/n, where n is the size of the original dataset. If δ=0, the mechanism provides pure ϵ-DP.
If the synthetic data generation process claims to be (ϵ,δ)-differentially private, this fundamentally changes the privacy evaluation landscape. The primary focus shifts from discovering vulnerabilities through attacks to verifying and interpreting the claimed DP guarantee.
Verifying Claimed DP Guarantees
Verifying the claimed (ϵ,δ) parameters is not trivial and typically involves several approaches:
-
Algorithm Analysis: The most rigorous method involves a careful theoretical analysis of the specific DP algorithm used for generation (e.g., DP-SGD, Laplace/Gaussian mechanism applied to statistics, PATE). This involves understanding how the privacy budget ϵ is composed and accumulated across different steps of the algorithm. This analysis is often performed by the designers of the DP system and relies on established composition theorems (e.g., basic and advanced composition) to track the total privacy cost. As an evaluator, you might review the published methodology or technical papers describing the generation process.
-
Implementation Audit: Even if the algorithm is theoretically sound, implementation errors (e.g., incorrect noise calibration, floating-point issues, insecure random number generation) can break the DP guarantees. Auditing the source code or using specialized analysis tools can help identify such flaws, although this requires significant expertise.
-
Empirical Testing (Sanity Checks): While empirical tests cannot prove DP, they can serve as sanity checks or help detect gross violations.
- Sensitivity Analysis: One could try to estimate the sensitivity of the generation process empirically. This involves running the generator on datasets differing by one record and observing the output changes. However, this is often computationally expensive and statistically difficult.
- Targeted Attacks: Running Membership Inference Attacks specifically designed against DP mechanisms might provide some insight. If an MIA performs significantly better than the theoretical bounds implied by (ϵ,δ), it could indicate a problem with the claim or implementation. However, failure of an attack doesn't prove the DP guarantee holds.
Interpreting DP Guarantees in Context
Given a claimed (ϵ,δ)-DP guarantee, its practical meaning needs interpretation:
- Worst-Case Bound: DP provides a worst-case guarantee against an adversary with arbitrary background knowledge. It limits the additional information an adversary can learn about an individual specifically because their data was included in the dataset used for generation.
- Privacy vs. Utility: There is an inherent trade-off. Achieving a very small ϵ (strong privacy) typically requires adding more noise or perturbing the data more significantly, which often reduces the statistical fidelity and machine learning utility of the synthetic data. Evaluating this trade-off is important. Does the achieved (ϵ,δ) provide meaningful privacy while maintaining sufficient utility for the intended downstream task?
Utility generally decreases as the privacy budget epsilon decreases (meaning privacy protection becomes stronger).
DP in the Broader Privacy Assessment
When DP is used, it provides a strong foundational layer of privacy protection. However, it doesn't necessarily negate the value of other techniques discussed in this chapter:
- Complementary Information: MIAs or attribute inference attacks run on DP synthetic data can still provide insights. While DP bounds the success of these attacks in theory, empirical results can help understand the practical privacy level achieved, especially in relation to the specific models and data characteristics. They can also act as the sanity checks mentioned earlier.
- Utility Evaluation: The impact of DP noise on downstream model performance (Chapter 3) becomes a primary concern. Evaluating TSTR/TRTS performance is essential to ensure the data remains useful despite the privacy enhancements.
- Communication: Clearly stating the (ϵ,δ) values achieved is fundamental when reporting on the privacy characteristics of the synthetic dataset.
In summary, if differential privacy mechanisms were employed during synthetic data generation, your evaluation should focus on verifying the claimed parameters, understanding the implications of the (ϵ,δ) guarantee, and assessing the resulting trade-off with data utility. While DP provides a formal guarantee, complementing it with empirical checks and utility assessments provides a more complete picture of the synthetic data's privacy profile.