Evaluating generative models that produce outputs based on specific conditions, such as class labels, text descriptions, or other guiding inputs, requires specialized approaches. While metrics like FID and IS assess overall sample quality and diversity, they don't directly measure whether the generated output faithfully adheres to the provided condition. Evaluating conditional models, therefore, involves assessing two primary aspects:
Let's examine techniques for measuring these aspects.
The goal here is to quantify how well the generated output matches the input condition. The methods vary depending on the type of condition.
For models conditioned on class labels (e.g., generating images of specific dog breeds), a common approach is to use a pre-trained classifier.
This is sometimes referred to as "Classification Accuracy Score" (CAS) in literature. A variation involves calculating the KL divergence between the predicted class distribution for generated samples of a target class and a one-hot vector representing the target class. Lower KL divergence indicates better alignment.
However, be mindful of potential issues:
For models generating images from text prompts, evaluating conditional consistency requires measuring the semantic alignment between the text and the image. The most prevalent metric for this is the CLIP Score.
CLIP (Contrastive Language–Image Pre-training) is a model trained by OpenAI on a massive dataset of image-text pairs. It learns a shared embedding space where corresponding images and text descriptions have high cosine similarity.
To calculate the CLIP score:
A higher average CLIP score indicates better alignment between the generated images and their corresponding text prompts. While widely used, CLIP itself has limitations and biases inherited from its training data, which can influence the score.
For other types of conditions (e.g., generating an image based on a segmentation map, style transfer based on a reference image), evaluation often relies on domain-specific metrics.
Beyond checking if the output matches the condition, we need to ensure the quality and diversity within each condition are satisfactory. A model might produce excellent images for one class but poor ones for another, or it might suffer from mode collapse only for certain conditions.
Standard evaluation metrics like FID, Precision, Recall, and KID can be adapted for conditional assessment:
For instance, you would compute:
This provides a more granular view than a single global FID score. It can reveal if the model performs unevenly across different conditions or suffers from intra-class mode collapse (low diversity for a specific class).
The chart shows hypothetical FID scores calculated separately for images generated under four different class conditions. Class C exhibits a significantly higher FID, indicating lower quality or diversity for generated samples belonging to that specific class compared to the others.
Similarly, precision and recall can be calculated per condition to understand if the generator covers the diversity of real samples within that condition (recall) and if the generated samples are realistic for that condition (precision).
In practice, evaluating conditional generative models often involves reporting a combination of metrics:
This multi-faceted approach provides a more comprehensive understanding of the model's capabilities and weaknesses, guiding further development and optimization efforts. Remember that choosing the right evaluation metrics depends heavily on the specific task and the nature of the conditions involved.
Was this section helpful?
© 2025 ApX Machine Learning