BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - This influential paper introduces BLEU, a widely used automatic metric for evaluating the quality of machine-generated text by comparing it to reference translations. It is relevant for tasks requiring high precision in matching reference texts.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004Text Summarization Branches Out (Association for Computational Linguistics)DOI: 10.3115/1621300.1621317 - This paper presents ROUGE, a set of metrics used for evaluating automatic summarization and other text generation tasks. It measures overlap with reference summaries through n-grams and longest common subsequences.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - This research explores the reliability of using large language models as evaluators ('LLM-as-a-Judge') for other LLMs. It introduces benchmarks and approaches for this model-based evaluation method.