Sequence Transduction with Recurrent Neural Networks, Alex Graves, 2012International Conference of Machine Learning (ICML) 2012 Workshop on Representation LearningDOI: 10.48550/arXiv.1211.3711 - This foundational paper introduces the Recurrent Neural Network-Transducer (RNN-T), a key architecture for streaming ASR due to its ability to emit outputs step-by-step without waiting for the full input sequence.
Robust Voice Activity Detection Using Deep Neural Networks, Xiang Zhang, S. M. Ahadi, V. Ramana Rao, T. J. Cox, 2013Interspeech 2013 (ISCA (International Speech Communication Association))DOI: 10.21437/Interspeech.2013-402 - This paper explores the use of Deep Neural Networks for Voice Activity Detection, a component for efficient real-time streaming ASR systems to manage computational resources and utterance boundaries.