While visual inspection provides initial intuition, relying solely on subjective assessment is insufficient for evaluating generative models rigorously. We need objective, quantitative measures to compare different models, track training progress, and understand the specific strengths and weaknesses of our generators. Several metrics have been developed, primarily using features extracted by pre-trained image classification networks like Inception. Let's examine some of the most common ones: Inception Score (IS), Fréchet Inception Distance (FID), and Precision/Recall adapted for generative modeling.
Inception Score (IS)
The Inception Score (IS) was one of the first widely adopted quantitative metrics for GAN evaluation, particularly for image generation. It aims to measure two desirable properties of generated samples simultaneously:
- Sample Quality: Generated images should contain meaningful objects. For an image x, a good classifier should assign it to a specific class with high confidence. This translates to the conditional class distribution p(y∣x) having low entropy.
- Sample Diversity: The generator should produce varied images covering different classes. Across all generated images, the marginal distribution p(y)=∫p(y∣x)pg(x)dx (where pg is the generator's distribution) should have high entropy, indicating a wide range of classes are represented.
IS combines these by calculating the Kullback-Leibler (KL) divergence between the conditional distribution p(y∣x) for a specific generated image x and the marginal distribution p(y) averaged over all generated images. The score is then exponentiated:
IS(G)=exp(Ex∼pg[DKL(p(y∣x)∣∣p(y))])
To compute IS, a large number of samples are generated, passed through a pre-trained Inception network (typically Inception v3 trained on ImageNet), and the resulting class probabilities are used to estimate p(y∣x) and p(y). A higher IS is generally considered better.
However, IS has significant limitations:
- It doesn't compare the generated distribution to the real data distribution directly. A generator could achieve a high IS by producing high-quality, diverse images from a distribution completely different from the target dataset.
- It's sensitive to the properties of the pre-trained classifier.
- It primarily measures diversity across ImageNet classes, which might not be relevant for datasets with different content.
- It has been shown that IS can be artificially inflated without genuinely improving sample quality.
Due to these drawbacks, while historically important, IS is often supplemented or replaced by metrics like FID.
Fréchet Inception Distance (FID)
The Fréchet Inception Distance (FID) has become a standard metric for evaluating the quality of generated images. Unlike IS, FID explicitly compares the statistics of generated samples to those of real samples. It measures the distance between the distributions of real and generated images in a feature space defined by an intermediate layer of a pre-trained Inception v3 network.
The calculation involves these steps:
- Embed a set of real images (Xr) and a set of generated images (Xg) into a feature space using a specific layer of the Inception v3 model (often the output of the final average pooling layer, resulting in 2048-dimensional vectors).
- Assume these feature vectors for both real and generated sets follow multivariate Gaussian distributions. Calculate the mean vector (μr,μg) and the covariance matrix (Σr,Σg) for the features of the real and generated images, respectively.
- Compute the Fréchet distance (also known as the Wasserstein-2 distance) between these two Gaussian distributions:
FID(Xr,Xg)=∣∣μr−μg∣∣22+Tr(Σr+Σg−2(ΣrΣg)1/2)
Here, ∣∣⋅∣∣22 denotes the squared Euclidean norm (sum of squared differences between elements of the mean vectors), Tr is the trace of a matrix (sum of diagonal elements), and (ΣrΣg)1/2 is the matrix square root of the product of the covariance matrices.
Interpretation:
- A lower FID indicates that the distribution of generated image features is closer to the distribution of real image features, suggesting better quality and diversity similarity. An FID of 0 would mean the distributions are identical in this feature space.
- The first term (∣∣μr−μg∣∣22) measures the distance between the means (capturing differences in the average features).
- The second term involving the trace measures the distance related to the covariance matrices (capturing differences in how features vary and correlate).
FID addresses several shortcomings of IS:
- It directly compares generated data to real data.
- It is more sensitive to mode collapse (low diversity) than IS.
- It is generally more robust to noise than IS.
Still, FID is not without its considerations:
- It relies on the Inception v3 feature space, which might not be optimal for all datasets or tasks.
- The calculated value depends on the number of samples used for estimation; comparing FID scores requires using the same number of real and generated samples (typically 10,000 or 50,000).
- Computing FID can be computationally intensive, especially the covariance matrix and its square root for high-dimensional features. You'll get hands-on experience with this calculation later in the chapter.
Precision and Recall
While FID provides a single number summarizing the distance between distributions, it doesn't separately quantify fidelity (quality) and diversity. Borrowing terminology from classification, Precision and Recall metrics were adapted for generative models to provide a more fine-grained analysis.
Imagine the manifold (the underlying structure) of real data and the manifold of generated data within a suitable feature space (e.g., obtained via VGG or Inception embeddings).
- Precision: Measures the fraction of generated samples that fall within the support of the real data manifold. High precision means most generated samples are realistic or plausible according to the real data distribution. It answers: "Are the generated samples high fidelity?"
- Recall: Measures the fraction of the real data manifold that is covered by the generated data manifold. High recall means the generator can produce samples covering the full variety present in the real data. It answers: "Can the generator capture the diversity of the real data?"
Calculating these precisely often involves complex manifold estimation and distance computations in high-dimensional space. Common approaches use k-nearest neighbors (k-NN) in the feature space:
- Embed large sets of real (Nr) and generated (Ng) samples into the feature space.
- For each generated sample, find its k-nearest neighbors among the real samples. If a generated sample's k-th nearest real neighbor is within a certain distance threshold (or if the sample lies within the estimated real manifold), it contributes to precision.
- Similarly, for each real sample, find its k-nearest neighbors among the generated samples. If a real sample is "covered" by nearby generated samples, it contributes to recall.
The exact definitions and computation methods can vary (see Kynkäänniemi et al., 2019 for a common approach).
Illustration of Precision and Recall concepts using simplified 2D feature space representations. Real data points are shown as blue circles. Pink crosses represent high precision (realistic samples) but low recall (missing parts of the real distribution). Orange diamonds represent low precision (some unrealistic samples outside the blue area) but potentially higher recall (covering more of the real distribution's span). Green stars represent the ideal case of high precision and high recall.
Precision and Recall offer valuable insights:
- A generator might achieve a decent FID but suffer from mode collapse (high precision, low recall).
- Another generator might cover the modes but produce many unrealistic samples (low precision, high recall).
- These metrics help diagnose such trade-offs, guiding model improvements.
In practice, FID remains the most commonly reported single metric, while Precision and Recall offer a more detailed diagnostic view, particularly when analyzing the fidelity-diversity trade-off. Evaluating models using a combination of these quantitative metrics, alongside qualitative visual inspection, provides a more comprehensive understanding of generator performance.