The Fréchet Inception Distance (FID) provides a quantitative measure of the similarity between the distribution of generated images and the distribution of real images, based on features extracted by a pre-trained Inception network. A lower FID score indicates that the two distributions are closer, suggesting the generated images are more similar to the real ones in terms of both visual quality and diversity. However, understanding what a specific FID number means requires context.
Comparing FID Scores
FID scores are most valuable in a comparative sense. An isolated FID score doesn't tell the whole story. Its primary uses are:
- Model Comparison: When evaluating different GAN architectures or variations trained on the same dataset, the model achieving the lower FID is generally considered better. For instance, comparing StyleGAN2 and BigGAN on the FFHQ dataset, the one with the lower FID demonstrates a generated distribution closer to the real face distribution, according to the Inception features.
- Tracking Training Progress: Monitoring FID over training epochs is a standard practice. A decreasing FID trend suggests the generator is improving its ability to mimic the real data distribution. Plateaus or increases can indicate training stagnation or instability.
- Ablation Studies: When assessing the impact of specific techniques (e.g., a new loss function component, a normalization layer), comparing the FID with and without the technique provides quantitative evidence of its effectiveness.
- Benchmarking: Comparing your model's FID on a standard dataset (like CIFAR-10, CelebA, LSUN) against reported scores from published research helps gauge its performance relative to the state of the art.
What Constitutes a "Good" Score?
There's no single threshold for a "good" FID score. It is highly dependent on:
- Dataset Complexity: Generating realistic images for complex datasets like ImageNet or high-resolution faces (FFHQ) is inherently harder and will typically result in higher minimum achievable FID scores compared to simpler datasets like MNIST or CIFAR-10.
- Image Resolution: Higher resolution generation tasks generally correlate with higher FID scores.
- State-of-the-Art: What's considered "good" evolves as research progresses. A score considered excellent a few years ago might be average today.
As a rough guideline on standard benchmarks (like CelebA 256x256 or FFHQ 1024x1024), scores in the single digits (e.g., 2-5) often represent very high-quality generation, closely matching the real data distribution as perceived by the Inception network. Scores in the 10-30 range might indicate good generation, while scores above 50 often suggest noticeable differences in quality or diversity compared to real images. However, always reference benchmark results specific to the dataset and resolution you are working with.
Example plot showing FID score reduction over training epochs for two hypothetical models on the same dataset. Model A shows better convergence towards a lower FID compared to Model B.
Factors Influencing FID Calculation
When calculating or comparing FID scores, consistency is significant:
- Number of Samples: FID stability depends on using a sufficient number of samples. Common practice involves using 10,000 or 50,000 generated and real samples. Using too few samples can lead to noisy and unreliable estimates of the distribution statistics (mean and covariance of Inception activations). The calculation involves estimating the mean (μr,μg) and covariance (Σr,Σg) matrices of the Inception activations for real and generated images, respectively. The FID is then calculated as:
FID=∣∣μr−μg∣∣22+Tr(Σr+Σg−2(ΣrΣg)1/2)
Accurate estimation of these statistics requires adequate sample sizes.
- Inception Network: Ensure the same pre-trained Inception v3 model implementation is used for all comparisons. Minor differences in implementations or weights can affect the extracted features and thus the final score. Reference implementations often use weights pre-trained on ImageNet.
- Image Preprocessing: Both real and generated images must undergo identical preprocessing steps (resizing to Inception's expected input size, typically 299x299, and normalization, usually scaling pixel values to the range [-1, 1] or [0, 1]) before being fed into the Inception network. Any discrepancy here will invalidate comparisons.
Limitations to Consider
While widely adopted, FID is not a perfect metric. Keep in mind:
- Perceptual Alignment: FID relies on features from Inception v3, trained for image classification. While these features capture relevant statistics, they might not perfectly align with human perception of image quality or specific types of artifacts. A model could potentially achieve a good FID score while still exhibiting subtle, repetitive patterns or other flaws noticeable to humans.
- Sensitivity: It can be sensitive to specific failure modes like mode collapse (low diversity), although generally less so than the Inception Score. It measures the distance between average distributions and might not fully capture outlier quality or fine-grained details. For example, if a generator produces perfect images but only covers half the modes of the real data, the FID score will increase, but it might not fully reflect the severity of the mode collapse compared to other potential issues.
- No Insight into Specific Flaws: A single FID number doesn't tell you why a model is performing poorly. It doesn't distinguish between low quality (blurry images, artifacts) and low diversity (mode collapse). Qualitative assessment remains necessary to understand the nature of the generation imperfections.
In summary, FID is a powerful tool for quantitative GAN evaluation, particularly effective for comparing models and tracking training progress. Interpret scores relatively, ensure consistent calculation methodology, and always complement quantitative metrics with qualitative visual inspection of the generated samples to get a complete picture of your GAN's performance.