Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A foundational textbook covering language models, perplexity, and general evaluation techniques in natural language processing.
BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - The original paper introducing BLEU, a widely adopted metric for automatic evaluation of machine translation.
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Satanjeev Banerjee, Alon Lavie, 2005Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (Association for Computational Linguistics) - Describes METEOR, a machine translation evaluation metric designed to better correlate with human judgments than BLEU.