Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015arXiv preprint arXiv:1503.02531DOI: 10.48550/arXiv.1503.02531 - Introduces the fundamental concept of knowledge distillation, including the use of softened targets (temperature scaling) and Kullback-Leibler divergence for transferring knowledge from a large teacher model to a smaller student.
TinyBERT: Distilling BERT for Natural Language Understanding, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, 2020EMNLP 2020DOI: 10.48550/arXiv.1909.10351 - Presents a comprehensive knowledge distillation framework for compressing BERT, which includes both general pre-training distillation (task-agnostic) and task-specific fine-tuning distillation, demonstrating a hybrid approach.