Speaker Recognition from Raw Waveform with SincNet, Mirco Ravanelli and Yoshua Bengio, 2018Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE)DOI: 10.1109/ICASSP.2018.8462083 - Introduces SincNet, a deep learning architecture that learns interpretable band-pass filters directly from the raw audio waveform, demonstrating an early approach to learned feature extraction.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, 2020Advances in Neural Information Processing Systems (NeurIPS) 33DOI: 10.48550/arXiv.2006.11477 - Presents wav2vec 2.0, a prominent self-supervised learning framework that directly processes raw audio waveforms to learn powerful speech representations, bypassing traditional feature extraction.