While the Inception Score (IS) provides a quantitative measure, it primarily focuses on the properties of the generated images themselves (clarity and diversity based on Inception classifications) without directly comparing their distribution to the real data distribution. This can sometimes be misleading, especially if the generator produces high-quality but non-diverse samples (mode collapse) that happen to fool the Inception classifier well for a limited set of classes.
The Fréchet Inception Distance (FID) addresses this limitation by directly comparing the statistics of features extracted from real images to those extracted from generated images. It provides a more principled measure of the distance between these two distributions in a high-dimensional feature space.
The core idea behind FID is to represent both the set of real images and the set of generated images using embeddings produced by a pre-trained deep convolutional network, typically the Inception v3 model. Instead of using the final classification output, FID uses the activations from an intermediate layer (commonly the final average pooling layer before the classification head). This layer captures rich, high-level visual features.
Here's the process:
Feature Extraction: A large batch of real images (Nr) and a large batch of generated images (Ng) are passed through the pre-trained Inception v3 network. The activations from the chosen intermediate layer are collected for each image. This results in two sets of feature vectors: one set for the real images (Xr) and one for the generated images (Xg). Each feature vector resides in a high-dimensional space (e.g., 2048 dimensions for the typical Inception v3 layer).
Distribution Modeling: FID assumes that these high-dimensional feature vectors for both the real and generated images can be reasonably modeled by multivariate Gaussian distributions. This is a simplification, but it works well in practice.
Calculating Statistics: The mean vector and the covariance matrix are calculated for each set of feature vectors:
The Fréchet distance is a measure of distance between two multivariate Gaussian distributions. The FID score is calculated using the means and covariance matrices of the Inception features as follows:
FID(Xr,Xg)=∣∣μr−μg∣∣22+Tr(Σr+Σg−2(ΣrΣg)1/2)Let's break down this formula:
∣∣μr−μg∣∣22: This is the squared Euclidean distance (or L2 norm squared) between the mean feature vectors of the real (μr) and generated (μg) images. It measures how much the average features differ between the two sets. A smaller distance indicates that, on average, generated images have similar high-level features to real images.
Tr(…): This term involves the trace (sum of diagonal elements) of a combination of the covariance matrices (Σr and Σg). The covariance matrix captures the spread and correlation between different feature dimensions. This term measures the distance between the covariance structures of the two distributions. (ΣrΣg)1/2 represents the matrix square root of the product of the covariance matrices. A smaller trace value indicates that the spread and correlations of features in the generated images are similar to those in the real images. This part is particularly important for capturing the diversity of the generated samples relative to the real data.
The FID score is a non-negative value, measured in units related to the squared distance in the feature space.
Diagram illustrating FID as a measure of distance between the distributions of real and generated image features (modeled as Gaussians) in the Inception feature space. FID considers both the difference in means (μr,μg) and covariance matrices (Σr,Σg).
Compared to the Inception Score, FID is generally considered more robust. It is sensitive to both mode collapse (which affects Σg) and image artifacts (which affect both μg and Σg). It also provides a more direct comparison between the generated distribution and the target real distribution. However, remember that FID depends on the choice of the pre-trained model (Inception v3) and the specific feature layer used. Furthermore, calculating a stable FID score requires a reasonably large number of samples (typically 10,000 or more, often recommended 50,000) from both the real and generated sets to get reliable estimates of the mean and covariance.
© 2025 ApX Machine Learning