Considerations for Real-time Streaming ASR

ASR systems commonly operate in an offline, or batch, mode. In this mode, a complete audio file is provided, which the model processes to return a full transcript. This method is well-suited for transcribing recorded lectures or interviews. However, for applications such as voice assistants, live captioning, or dictation, the system must process audio instantly as it is spoken, delivering results with minimal delay. Addressing this need is the challenge of real-time streaming ASR.

Moving from an offline model to a streaming one is more than a minor adjustment. It requires fundamental changes to the model architecture and the entire processing pipeline to address two primary constraints: latency and causality.

Causality: The Inability to See the Future

Many of the high-performing architectures you have learned about, such as Bidirectional LSTMs (BiLSTMs) or the standard Transformer, are non-causal. A BiLSTM processes a sequence both forwards and backwards, meaning its prediction for a word at time $t$ depends on all other words, including those that come after it. Similarly, the self-attention mechanism in a Transformer looks at the entire input sequence to compute its representations.

This is a luxury that a real-time system does not have. To produce a transcript with low latency, the model can only use audio it has already received. It cannot wait for the user to finish speaking to begin transcribing. This means any model used for streaming must be causal, making predictions based only on past and present information.

Data flow in a bidirectional model versus a causal, unidirectional model. The bidirectional model uses future context (red arrows), making it unsuitable for real-time applications. The causal model only processes information from the past and present (blue arrows).

Chunk-based Processing and State Management

The most common approach to implement streaming is to process the incoming audio in small, contiguous segments, often called chunks. For example, a system might buffer 200 milliseconds of audio, process it, emit a partial transcript, and then move to the next 200-millisecond chunk.

This introduces a new problem: context. An acoustic model needs context from previous chunks to make sense of the current one. If you treat each chunk as an independent input, the model will struggle at the boundaries. For instance, a phoneme might be split right between two chunks.

To solve this, streaming systems must be stateful. If you are using an RNN-based architecture, this is straightforward. The final hidden state of the RNN after processing chunk $N$ is saved and used as the initial hidden state for processing chunk $N+1$ . This allows information to flow across chunk boundaries, giving the model a memory of what was said previously. For Transformer-based models, a similar state can be maintained by caching the keys and values of the attention layers from previous chunks.

Voice Activity Detection (VAD)

A streaming ASR system should not be processing audio constantly. It would waste significant computational resources processing silence or background noise. Furthermore, the system needs a signal to know when an utterance is complete so it can finalize the transcript and reset for the next one.

This is the job of a Voice Activity Detection (VAD) module. A VAD is a smaller, highly efficient model or algorithm that does one thing: it distinguishes between speech and non-speech. The ASR pipeline uses the VAD as a gatekeeper:

Listen: The VAD analyzes the incoming audio stream.
Trigger: When the VAD detects the start of speech, it signals the main ASR model to begin processing and transcription.
Finalize: When the VAD detects a sufficiently long period of silence after speech, it signals that the utterance is likely complete. The ASR system can then finalize its hypothesis, return the full transcript, and reset its state, waiting for the next VAD trigger.

Streaming-Aware Architectures and Decoding

While you can adapt a unidirectional LSTM with CTC for streaming, certain architectures are inherently better suited for this task.

RNN-Transducer (RNN-T): This model is a popular choice for on-device and streaming ASR. Unlike CTC, which requires a complete sequence (or chunk) to compute its loss, the RNN-T produces an output symbol for each input time step. It combines an encoder network (like an LSTM) with a "prediction network" that models the text-only context. A joint network then combines their outputs to predict the next character. This step-by-step emission is a natural fit for streaming.
Chunk-wise Transformers: To make Transformers causal, the self-attention mechanism is modified. Instead of attending to the full sequence, attention is constrained to a fixed-size look-back window, or "left context." By processing audio in chunks and caching the representations of past chunks, these models can effectively simulate a sliding attention window over the audio stream, combining the power of Transformers with the requirements of low-latency processing.

Finally, the decoding process also changes. Instead of producing one final transcript, a streaming decoder must continually update its hypothesis as each new chunk of audio arrives. This means users will see the transcript text appearing and sometimes correcting itself in real-time as the model gains more context. Managing this to provide a stable and readable live output is a significant user experience challenge.

Was this section helpful?

References

Sequence Transduction with Recurrent Neural Networks, Alex Graves, 2012 International Conference of Machine Learning (ICML) 2012 Workshop on Representation Learning DOI: 10.48550/arXiv.1211.3711 - This foundational paper introduces the Recurrent Neural Network-Transducer (RNN-T), a key architecture for streaming ASR due to its ability to emit outputs step-by-step without waiting for the full input sequence.
Robust Voice Activity Detection Using Deep Neural Networks, Xiang Zhang, S. M. Ahadi, V. Ramana Rao, T. J. Cox, 2013 Interspeech 2013 (ISCA (International Speech Communication Association)) DOI: 10.21437/Interspeech.2013-402 - This paper explores the use of Deep Neural Networks for Voice Activity Detection, a component for efficient real-time streaming ASR systems to manage computational resources and utterance boundaries.
Deep Learning in Spoken Language Processing: From Speech Recognition to Text-to-Speech, Li Deng, Xiao Li, 2019 (Academic Press) DOI: 10.1016/C2016-0-03004-7 - This book provides a comprehensive overview of deep learning techniques applied to spoken language processing, including discussions on various ASR architectures and their practical considerations relevant to both offline and streaming modes.