Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015arXiv preprint arXiv:1503.02531DOI: 10.48550/arXiv.1503.02531 - Foundational paper introducing the concept of knowledge distillation, which underpins the need for evaluating student model fidelity against a teacher.
GLUE: A Multi-Task Benchmark for Natural Language Understanding, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, 2018ICLR 2019 (published as arXiv preprint)DOI: 10.48550/arXiv.1804.07461 - Introduces the GLUE benchmark, a standard suite for evaluating NLU models, crucial for assessing the task performance and fidelity of distilled models.
BERTScore: Evaluating Text Generation with BERT, Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, 2019ICLR 2020 (published as arXiv preprint)DOI: 10.48550/arXiv.1904.09675 - Proposes BERTScore, an automatic metric for evaluating text generation that correlates better with human judgment by leveraging contextual embeddings.