While metrics like FID and IS provide valuable insights into the overall distribution similarity and sample quality, they don't explicitly probe the structure of the generator's learned latent space. A well-behaved generator should ideally map nearby points in the latent space to images that are also perceptually similar. Abrupt changes in the generated image for tiny steps in the latent code often indicate a poorly structured or "entangled" latent space, which can hinder tasks like smooth image interpolation or attribute manipulation.
The Perceptual Path Length (PPL) metric was introduced specifically to quantify this notion of latent space smoothness. It measures, on average, the perceptual change in the generated image when taking small steps along interpolation paths within the latent space. The core idea is that if the latent space is well-structured, interpolating between two latent codes should result in a correspondingly smooth perceptual transition in the generated images.
How PPL Works
Calculating PPL involves several steps:
- Latent Space Sampling: Randomly sample pairs of latent vectors, z1 and z2, from the generator's input distribution (e.g., a standard Gaussian distribution P(z)).
- Interpolation: Define an interpolation path between these two points. Linear interpolation (lerp) is commonly used:
z(t)=(1−t)z1+tz2
where t ranges from 0 to 1.
- Stepping and Generation: Take very small steps along this path. For a given t and a small step size ϵ (e.g., ϵ=10−4), generate images for two points infinitesimally close on the path: G(z(t)) and G(z(t+ϵ)).
- Perceptual Distance: Calculate the perceptual distance between these two generated images. This requires a pre-trained network that captures human perceptual similarity. A common choice is the Learned Perceptual Image Patch Similarity (LPIPS) metric, often based on features extracted from networks like VGG or AlexNet. Let d(⋅,⋅) represent this perceptual distance function.
- Averaging: Compute the distance d(G(z(t)),G(z(t+ϵ))) for many small steps t along the path. Average these distances. Crucially, the original formulation scales this distance by 1/ϵ2 to approximate a derivative, but in practice, the metric often reported is the average distance scaled by 1/ϵ. The final PPL is the expected value of these averaged path distances over many randomly sampled pairs (z1,z2).
Mathematically, the core calculation for a single path involves averaging the scaled perceptual distance over many points t:
PPLpath=meant(ϵ1d(G(lerp(z1,z2,t)),G(lerp(z1,z2,t+ϵ))))
The final PPL score is the average of PPLpath over many initial pairs (z1,z2).
Interpolation Spaces: Z vs W
For generators like StyleGAN that utilize a mapping network to transform the initial noise vector z into an intermediate latent code w (often residing in a space denoted as W), interpolation can be performed in either the initial Z space or the intermediate W space.
- PPL in Z (PPLz): Interpolation is performed directly between z1 and z2, and the resulting z(t) vectors are fed through the mapping network and then the synthesis network.
- PPL in W (PPLw): First, z1 and z2 are mapped to w1=Mapping(z1) and w2=Mapping(z2). Interpolation is then performed in the W space: w(t)=lerp(w1,w2,t). These interpolated w(t) vectors are then fed into the synthesis network Gsynth(w(t)).
It has been observed, particularly with StyleGAN architectures, that the W space is often more "disentangled" than the Z space. Interpolation in W tends to produce more perceptually linear transitions in the output images. Consequently, PPLw (PPL calculated by interpolating in W) often yields lower (better) scores than PPLz and is considered a more representative measure of the perceptual smoothness achieved by the synthesis network itself.
Flow of Perceptual Path Length calculation. Interpolation occurs in the latent space, generated images are compared using a perceptual distance metric, and results are averaged.
Interpreting PPL Scores
- Lower is Better: A lower PPL score indicates that small steps in the latent space correspond to small perceptual changes in the generated image. This suggests a smoother, potentially more disentangled latent representation, which is generally desirable.
- Higher is Worse: A high PPL score implies that small perturbations in the latent code can lead to large, abrupt visual changes. This often correlates with visual artifacts or indicates that the latent space structure does not map well to perceptual variations.
PPL is particularly useful for comparing different generator architectures or regularization techniques aimed at improving latent space properties. For instance, the StyleGAN paper used PPL extensively to demonstrate the benefits of the mapping network and style modulation using the W space.
Limitations
While powerful, PPL has some limitations:
- Computational Cost: It requires generating numerous images and performing pairwise perceptual distance calculations, making it significantly more computationally expensive than FID or IS.
- Dependence on Perceptual Metric: The results are inherently tied to the specific perceptual distance function (d) used (e.g., LPIPS with a VGG backbone). Different metrics might yield different scores or rankings.
- Focus on Smoothness: PPL primarily measures local smoothness along interpolation paths. It doesn't directly quantify overall sample fidelity (realism) or diversity in the same way as FID or IS, although extreme PPL values often correlate with quality issues.
- Sensitivity to ϵ: The choice of the step size ϵ can influence the results, although typically a very small value is used.
Despite these points, PPL provides a unique and valuable perspective on GAN performance by directly assessing the structure and perceptual consistency of the generator's learned latent space, complementing metrics that focus on the global properties of the generated distribution.