Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015NIPS 2014 Deep Learning WorkshopDOI: 10.48550/arXiv.1503.02531 - This foundational paper introduces knowledge distillation, explaining the use of soft targets and temperature scaling for training a smaller student model to mimic a larger teacher model.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, 20195th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019DOI: 10.48550/arXiv.1910.01108 - Presents a practical application of knowledge distillation to pre-trained transformer-based language models, showcasing techniques like output distribution matching and intermediate layer matching.