Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, 2008 (Cambridge University Press) - Provides a comprehensive introduction to core information retrieval metrics like precision, recall, and MRR, which are essential for evaluating the retriever component of RAG systems.
RAGAS: Automated Evaluation of Retrieval Augmented Generation, Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert, 2023arXiv preprint arXiv:2309.15217DOI: 10.48550/arXiv.2309.15217 - Introduces a framework specifically designed for evaluating RAG systems, detailing automated metrics for faithfulness, answer relevance, and context quality, directly relevant to the section.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - Investigates the efficacy and limitations of using large language models as automatic evaluators (LLM-as-a-judge) for generative tasks, a method that is directly relevant to assessing RAG outputs like faithfulness and answer relevance.