BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - Introduces the widely used BLEU metric for evaluating machine-generated text, a core component for assessing image captioning models.
CIDEr: Consensus-based Image Description Evaluation, Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh, 2015Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)DOI: 10.1109/CVPR.2015.7298926 - Presents CIDEr, a metric specifically designed for evaluating image captions by measuring consensus with human descriptions, directly relevant to the section.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook covering various aspects of deep learning, including explanations of common classification evaluation metrics like accuracy, precision, recall, and F1-score.
VQA: Visual Question Answering, Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, 2015Proceedings of the IEEE International Conference on Computer Vision (ICCV) (IEEE)DOI: 10.1109/ICCV.2015.279 - Introduces the benchmark VQA dataset and task, providing context for the evaluation of VQA models.