Quantitative Evaluation: ROUGE, BLEU, and Perplexity
Was this section helpful?
BLEU: A Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (Association for Computational Linguistics)DOI: 10.3115/1073083.1073135 - The original research paper introducing the BLEU metric, outlining its calculation based on modified n-gram precision and brevity penalty for machine translation evaluation.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (Association for Computational Linguistics)DOI: 10.3115/1621378.1621389 - The original publication that introduced the ROUGE suite of metrics, explaining how it measures recall-oriented n-gram overlap for automatic summary evaluation.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook covering natural language processing, including in-depth discussions on language models, perplexity definition and calculation, and various evaluation metrics such as BLEU and ROUGE.