While the aspiration for disentangled representations is clear. a representation where distinct generative factors in data map to distinct dimensions in the latent space. quantifying how well a model achieves this goal is a complex task. Simply observing generated samples or latent traversals can provide qualitative insights, but rigorous comparison and optimization require objective, quantitative measures. This section introduces several widely adopted metrics for evaluating disentanglement, primarily in scenarios where ground-truth factors of variation are known.
It's important to preface this discussion by acknowledging a common challenge: most disentanglement metrics rely on access to datasets with annotated, ground-truth factors of variation (e.g., in an image dataset, these could be object shape, color, position, scale). Datasets like dSprites, Shapes3D, and MPI3D were specifically created for this purpose. In many real-world applications, such ground-truth factors are unavailable, making the direct application of these metrics difficult. Nevertheless, they serve as invaluable tools for research and for understanding the properties of models that encourage disentanglement.
The diagram below illustrates the ideal scenario these metrics aim to quantify.
Idealized disentangled representation. Each ground-truth factor (gi) strongly influences a specific latent dimension (zi), while interactions with other latent dimensions are weak.
Let's examine some of the prominent metrics used to assess this ideal.
Mutual Information Gap (MIG)
The Mutual Information Gap (MIG) score (Chen et al., 2018) aims to quantify the degree to which individual latent dimensions zj capture individual ground-truth factors gk. The core idea is that for a well-disentangled representation, each factor gk should have high mutual information with one specific latent dimension and low mutual information with all others.
Calculation:
- Prerequisite: A dataset with N samples, D latent dimensions z=(z1,...,zD), and K known, discrete ground-truth factors g=(g1,...,gK).
- Estimate Mutual Information: For each pair of latent dimension zj and ground-truth factor gk, estimate the mutual information I(zj;gk). If zj is continuous, it's typically discretized by binning its empirical distribution. The mutual information is then computed using the joint and marginal probability distributions over the (now discrete) zj and gk.
I(zj;gk)=val(zj)∑val(gk)∑p(zj,gk)logp(zj)p(gk)p(zj,gk)
- Identify Top Latents per Factor: For each ground-truth factor gk:
- Find the latent dimension zj1 that has the highest mutual information with gk: j1=argmaxjI(zj;gk).
- Find the latent dimension zj2 that has the second highest mutual information with gk among the remaining latents (j2=j1).
- Calculate Normalized Gap: The gap for factor gk is the difference in mutual information, normalized by the entropy of the factor H(gk):
Gapk=H(gk)I(zj1;gk)−I(zj2;gk)
Normalization by H(gk) accounts for factors that are inherently easier or harder to predict.
- Average Gaps: The MIG score is the average of these normalized gaps over all K ground-truth factors:
MIG=K1k=1∑KGapk
Interpretation:
- A higher MIG score is better, indicating greater disentanglement. A score closer to 1 suggests that for each factor, one latent dimension is significantly more informative than any other.
- A score near 0 implies that factors are either not captured well, or the information is spread across multiple latent dimensions (entangled).
Strengths:
- Intuitive: Directly measures the "informativeness gap" which aligns well with the idea of one-to-one mapping.
- Normalization: Dividing by H(gk) provides a scale-invariant measure.
Weaknesses:
- Requires ground-truth factors.
- Estimation of mutual information, especially with discretization of continuous latents, can be sensitive to the number of bins and amount of data.
- Only considers the top two latents for each factor, potentially missing more complex entanglement patterns.
Separated Attribute Predictability (SAP)
The Separated Attribute Predictability (SAP) score (Kumar et al., 2017) takes a slightly different approach. Instead of directly measuring mutual information, it assesses how well each ground-truth factor can be predicted from individual latent dimensions using a simple classifier. The "separation" comes from the difference in predictability offered by the best versus the second-best latent dimension for a given factor.
Calculation:
- Prerequisite: Similar to MIG, a dataset with latent codes z and corresponding ground-truth factors g.
- Train Predictors: For every pair of a latent dimension zj and a ground-truth factor gk:
- Train a straightforward classifier (e.g., logistic regression, linear Support Vector Machine) to predict the factor gk using only the values of the single latent dimension zj.
- Evaluate the classifier's performance, typically using classification accuracy. This results in a score matrix S, where Sjk is the accuracy of predicting gk from zj.
- Identify Top Predicting Latents per Factor: For each ground-truth factor gk:
- Find the latent dimension zj1 that yields the highest prediction score for gk: Sj1k=maxjSjk.
- Find the latent dimension zj2 that yields the second highest prediction score for gk from the remaining latents.
- Calculate Score Difference: The SAP contribution for factor gk is the difference between these top two scores:
SAPk=Sj1k−Sj2k
- Average Differences: The overall SAP score is the average of these differences across all K ground-truth factors:
SAP=K1k=1∑KSAPk
Interpretation:
- A higher SAP score indicates better disentanglement. It suggests that for each factor, one latent dimension is substantially more predictive than any other single dimension.
- A low SAP score implies that either factors are not well predicted by individual latents, or multiple latents are similarly (and perhaps weakly) predictive.
Strengths:
- Uses a predictive framework, which can be more stable than MI estimation in some cases.
- The concept of "predictability difference" clearly targets the separation aspect.
Weaknesses:
- Requires ground-truth factors.
- The choice of classifier and its hyperparameters can influence the results. The metric's outcome might reflect the chosen classifier's ability as much as the representation's quality.
- It measures predictability rather than direct information content, which are related but not identical.
Disentanglement, Completeness, and Informativeness (DCI)
The DCI framework (Eastwood & Williams, 2018) offers a more multifaceted evaluation by proposing three distinct scores: Disentanglement, Completeness, and Informativeness. These metrics leverage the feature importances derived from training a predictor (often a Lasso regressor or a Random Forest) to predict each ground-truth factor using the entire set of latent dimensions.
-
Disentanglement Score:
- Intuition: Measures whether each latent dimension zj focuses on encoding a small number of ground-truth factors. Ideally, a single zj should be important for only one gk.
- Calculation (Simplified):
- For each factor gk, train a predictor (e.g., Lasso) using all latent dimensions z=(z1,...,zD) and extract feature importances Rjk (importance of zj for predicting gk).
- For each latent zj, consider its importance vector across all factors: (Rj1,Rj2,...,RjK). Normalize this vector so its elements sum to 1, forming a probability distribution.
- The disentanglement score for zj is 1−H(Pj), where H(Pj) is the entropy of this normalized importance distribution. If zj is important for only one factor, the distribution is peaked, entropy is low, and the score is high.
- The overall Disentanglement score is the average of these scores over all "active" latents (those that have non-negligible importance for at least one factor), possibly weighted by how predictive each latent is overall.
- Interpretation: A high DCI-Disentanglement score (closer to 1) indicates that individual latent dimensions tend to specialize in representing single factors of variation.
-
Completeness Score:
- Intuition: Measures whether each ground-truth factor gk is primarily captured by a small number of latent dimensions. Ideally, a single gk should be predictable from only one zj.
- Calculation (Simplified):
- Using the same importance matrix Rjk from above.
- For each factor gk, consider its importance vector across all latents: (R1k,R2k,...,RDk). Normalize this vector to sum to 1.
- The completeness score for gk is 1−H(Qk), where H(Qk) is the entropy of this normalized importance distribution. If gk is mainly explained by one latent, entropy is low, and score is high.
- The overall Completeness score is the average of these scores over all factors.
- Interpretation: A high DCI-Completeness score (closer to 1) suggests that each ground-truth factor is well-represented by a small subset (ideally one) of latent dimensions, rather than its information being diffusely spread.
-
Informativeness Score:
- Intuition: Measures how well the learned latent representation z as a whole can be used to predict the ground-truth factors gk. This is a basic check of whether the representation has captured useful information about the factors at all.
- Calculation: Typically, this is the average prediction accuracy (e.g., R2 for regression, classification accuracy for classification) when predicting each gk from the full latent code z.
- Interpretation: A high Informativeness score means the latents are useful for downstream prediction tasks related to the factors. Low informativeness suggests the VAE failed to learn a meaningful representation of these factors.
Strengths of DCI:
- Provides a more comprehensive view by separating concerns of disentanglement, completeness, and overall utility.
- Using feature importances from capable predictors like Random Forests or Lasso can capture non-linear relationships and provide robust importance estimates.
Weaknesses of DCI:
- Requires ground-truth factors.
- The choice of predictor model (e.g., Lasso, Random Forest) and its hyperparameters can affect the importance scores and thus the final DCI metrics.
- Can be more computationally intensive and complex to implement than MIG or SAP.
General Remarks on Disentanglement Metrics
While MIG, SAP, and DCI are common, other metrics exist, each with its own nuances. Most share the characteristic of relying on ground-truth factors. It's also worth noting that:
- No Single Perfect Metric: Disentanglement is a multifaceted property, and no single metric perfectly captures all its aspects. Different metrics might rank models differently.
- Sensitivity to Hyperparameters: Many metrics involve choices (number of bins for MI, type of classifier for SAP/DCI) that can influence scores. Consistent evaluation protocols are important.
- Correlation with Qualitative Assessment: Ideally, quantitative metrics should correlate with human judgment of disentanglement based on latent traversals. However, this is not always straightforward.
- Focus on Axis-Alignment: Most of these metrics implicitly favor axis-aligned disentanglement, where each factor aligns with a single coordinate axis in the latent space. More general notions of disentanglement (e.g., factors encoded in linear subspaces) are harder to quantify with these standard tools.
Understanding these metrics is crucial for anyone working on disentangled representation learning. They provide the tools to move beyond qualitative hunches and to rigorously assess progress in developing VAEs and other models that learn more interpretable and structured latent spaces. As you proceed to the hands-on practical later in this chapter, you will have the opportunity to implement some of these metrics to evaluate the VAEs you train.