Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) - A foundational textbook covering core information retrieval concepts, including standard evaluation metrics like Precision, Recall, F1-Score, MRR, and NDCG.
BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - Introduces BLEU, a widely used metric for evaluating the quality of machine-translated text by comparing it to reference translations.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers and Iryna Gurevych, 2019Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics)DOI: 10.18653/v1/D19-1410 - Presents Sentence-BERT, a modification of BERT that yields semantically meaningful sentence embeddings useful for tasks like semantic similarity and clustering.
Holistic Evaluation of Language Models, Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, et al., 2023Transactions on Machine Learning Research (TMLR)DOI: 10.48550/arXiv.2211.09110 - Proposes a broad framework for evaluating language models across diverse metrics, scenarios, and human preferences, addressing limitations of traditional metrics.