Measuring Alignment: Initial Metrics and Limitations
Was this section helpful?
Measuring Massive Multitask Language Understanding, Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2009.03300 - Introduces MMLU, a benchmark for evaluating broad knowledge and reasoning abilities of language models across 57 subjects.
Finetuned Language Models are Zero-Shot Learners, Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2109.01652 - Introduces the FLAN dataset and framework for instruction tuning, demonstrating improved zero-shot generalization for language models on various tasks.
TruthfulQA: Measuring Models' Honesty on Challenging Questions, Stephanie Lin, Jacob Hilton, Owain Evans, 2022Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)DOI: 10.48550/arXiv.2109.07958 - Presents TruthfulQA, a dataset and benchmark designed to measure how truthful language models are in generating answers to questions.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004Text Summarization Branches Out (Association for Computational Linguistics) - Introduces ROUGE, a widely used set of metrics for evaluating automatic summarization and machine translation by comparing system outputs with reference summaries.