While comparing marginal distributions and basic statistics offers a first glance at fidelity, as discussed earlier, these methods often miss the intricate web of relationships between variables within a dataset. Information theory provides a powerful mathematical framework to quantify uncertainty and dependence, offering more sophisticated tools to compare the structural properties of real (Dreal) and synthetic (Dsynth) datasets.
Instead of just asking "Are the means similar?", information-theoretic measures help us ask "Does the synthetic data contain the same amount of information?" or "Are the relationships between variables preserved in the synthetic data?".
Entropy is a fundamental measure of the uncertainty or randomness associated with a random variable. For a discrete random variable X with possible values x1,...,xk and probability mass function P(X), the Shannon entropy H(X) is defined as:
H(X)=−i=1∑kP(xi)log2P(xi)The logarithm base is typically 2, measuring entropy in bits. Higher entropy implies more uncertainty or less predictability.
In synthetic data evaluation:
Mutual Information (MI) measures the amount of information obtained about one random variable by observing another. It quantifies the dependency between two variables, capturing non-linear relationships beyond simple correlation. For two discrete variables X and Y, the mutual information I(X;Y) is:
I(X;Y)=x∈X∑y∈Y∑P(x,y)log2(P(x)P(y)P(x,y))It can also be expressed using entropy:
I(X;Y)=H(X)+H(Y)−H(X,Y)I(X;Y) is always non-negative and is zero if and only if X and Y are independent. A higher MI indicates stronger dependence.
Application in Evaluation: A key aspect of statistical fidelity is preserving the relationships between variables. We can compute the MI for pairs of features (Xi,Xj) in both the real and synthetic datasets: I(Xi,real;Xj,real) vs I(Xi,synth;Xj,synth).
Calculating MI typically requires estimating the joint and marginal probability distributions. For continuous data, this often involves discretization or using specialized estimators like k-nearest neighbors (KNN) based methods available in libraries such as scikit-learn (sklearn.metrics.mutual_info_score
for discrete, sklearn.feature_selection.mutual_info_regression
/mutual_info_classification
which use KNN estimation for continuous/mixed types).
import numpy as np
from sklearn.metrics import mutual_info_score
from sklearn.preprocessing import KBinsDiscretizer
# Assume real_data and synth_data are pandas DataFrames
# Example: Compare MI between 'Age' and 'Income' (assuming they are continuous)
# Discretize continuous features for MI calculation using mutual_info_score
n_bins = 10 # Choose an appropriate number of bins
kbd = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform', subsample=None)
# Fit on real data and transform both real and synthetic
# Important: Fit the discretizer ONLY on the real data
real_data_discrete = kbd.fit_transform(real_data[['Age', 'Income']])
synth_data_discrete = kbd.transform(synth_data[['Age', 'Income']])
# Calculate contingency matrix (necessary for mutual_info_score)
# Note: mutual_info_score expects labels, so we use the discretized values
# We need the joint distribution P(x,y), often estimated via histogramming / contingency table
def calculate_mi(data_col1, data_col2):
""" Calculates MI between two discrete columns """
contingency_matrix = np.histogram2d(data_col1, data_col2, bins=n_bins)[0]
# Add a small epsilon to avoid log(0) if necessary, or handle appropriately
# mutual_info_score handles this internally
return mutual_info_score(None, None, contingency_matrix=contingency_matrix)
mi_real = calculate_mi(real_data_discrete[:, 0], real_data_discrete[:, 1])
mi_synth = calculate_mi(synth_data_discrete[:, 0], synth_data_discrete[:, 1])
print(f"Mutual Information (Real Data, Age vs Income): {mi_real:.4f}")
print(f"Mutual Information (Synth Data, Age vs Income): {mi_synth:.4f}")
print(f"Difference in MI: {abs(mi_real - mi_synth):.4f}")
# Consider doing this for multiple feature pairs
Code snippet demonstrating the comparison of Mutual Information between two features ('Age', 'Income') after discretization in real and synthetic datasets using scikit-learn.
Divergence measures provide a way to quantify the difference between two probability distributions, P (representing the real data distribution) and Q (representing the synthetic data distribution).
Kullback-Leibler (KL) Divergence: Measures how much information is lost when approximating P with Q. For discrete distributions:
DKL(P∣∣Q)=x∑P(x)log2(Q(x)P(x))DKL(P∣∣Q)≥0, with DKL(P∣∣Q)=0 if and only if P=Q.
Jensen-Shannon (JS) Divergence: A symmetric and bounded measure derived from KL divergence.
JSD(P∣∣Q)=21DKL(P∣∣M)+21DKL(Q∣∣M)where M=21(P+Q) is the mixture distribution.
Practical Calculation: Calculating divergences for high-dimensional joint distributions is challenging. Common approaches involve:
Libraries like scipy.stats.entropy
can compute KL divergence between discrete distributions (represented as probability vectors). JS divergence can be implemented using the KL divergence function.
from scipy.stats import entropy
import numpy as np
# Assume p_real and q_synth are 1D probability distributions (histograms)
# e.g., from np.histogram(data_feature, bins=..., density=True)[0]
# Ensure they sum to 1 and have the same bins/support. Add small epsilon to avoid zeros.
epsilon = 1e-10
p_real = p_real + epsilon
p_real /= np.sum(p_real)
q_synth = q_synth + epsilon
q_synth /= np.sum(q_synth)
# KL Divergence (Real || Synth)
kl_div_pq = entropy(p_real, q_synth, base=2)
# KL Divergence (Synth || Real)
kl_div_qp = entropy(q_synth, p_real, base=2)
# JS Divergence
m = 0.5 * (p_real + q_synth)
js_div = 0.5 * (entropy(p_real, m, base=2) + entropy(q_synth, m, base=2))
print(f"KL Divergence (P||Q): {kl_div_pq:.4f}")
print(f"KL Divergence (Q||P): {kl_div_qp:.4f}")
print(f"JS Divergence (P||Q): {js_div:.4f}") # Should be between 0 and 1
Basic calculation of KL and JS divergence between two 1D probability distributions using SciPy. Requires careful preprocessing (normalization, handling zeros).
Information-theoretic measures offer a theoretically grounded way to assess statistical fidelity beyond simple statistics:
Advantages:
Challenges:
Despite the challenges, comparing mutual information between feature pairs and using JS divergence on lower-dimensional marginals (or carefully estimated joint distributions) provide valuable insights into how well the synthetic data replicates the statistical structure of the real data, significantly complementing other fidelity assessment techniques.
© 2025 ApX Machine Learning