HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed, 2021IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30DOI: 10.48550/arXiv.2106.07447 - This paper presents HuBERT, an alternative self-supervised method that uses an offline clustering step to generate discrete hidden units as targets for masked prediction, offering a different approach to speech representation learning.
Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, 2022 (OpenAI)DOI: https://arxiv.org/abs/2212.04356 - This report describes the Whisper model family, which achieves strong performance through weakly supervised training on a massive and diverse dataset of audio-text pairs, enabling multi-lingual and multi-task capabilities.