Sequence-to-Sequence (Seq2Seq) Models for ASR

Connectionist Temporal Classification (CTC) allows for training acoustic models without needing frame-by-frame alignment. However, CTC operates under a strong conditional independence assumption: the prediction at one timestep is independent of predictions at other timesteps. This limitation prevents the model from learning the linguistic structure inherent in the output sequence. Sequence-to-sequence (Seq2Seq) models offer a more powerful framework by directly modeling the probability of an entire output text sequence given an input audio sequence, $P(\text{text} | \text{audio})$ . This approach effectively learns both acoustic and linguistic patterns in a single, unified architecture.

The Encoder-Decoder Framework

At the heart of any Seq2Seq model is the encoder-decoder architecture. This design separates the task of processing the input from generating the output, providing a flexible structure for mapping sequences of different lengths.

The Encoder

The encoder's role is to process the entire input sequence of audio features, such as log-mel spectrograms, and condense this information into a set of high-level representations. Typically, the encoder is a multi-layer Recurrent Neural Network (RNN), often a Bidirectional Long Short-Term Memory (BiLSTM) network. By processing the sequence in both forward and backward directions, a BiLSTM encoder creates a hidden state at each timestep that contains information about the entire audio context surrounding that point.

The output of the encoder is a sequence of hidden state vectors, one for each input timestep. For an input spectrogram with $T$ frames, the encoder produces a sequence of hidden states $h_1, h_2, ..., h_T$ .

The Decoder

The decoder is an autoregressive RNN that generates the output transcript one token at a time (where a token can be a character, word, or sub-word unit). It functions as a conditional language model: given the encoded audio and the sequence of tokens already generated, it predicts the next token in the transcript.

At each step of the generation process, the decoder takes two primary inputs:

The previously generated token.
A context vector derived from the encoder's hidden states.

This step-by-step generation continues until the decoder produces a special end-of-sequence (<eos>) token.

The encoder processes audio frames ( $x_1, ..., x_T$ ) into hidden states ( $h_1, ..., h_T$ ). At each decoding step, the attention mechanism creates a context vector ( $c_t$ ) from all encoder states. The decoder uses this context and the previously generated token ( $y_{t-1}$ ) to produce the next token ( $y_t$ ).

Early Seq2Seq models attempted to compress all encoder hidden states into a single, fixed-size context vector. This created an information bottleneck, as a single vector struggled to retain all the necessary details from a long audio clip. The attention mechanism, which you were introduced to in the previous section, directly solves this problem by allowing the decoder to dynamically "look back" at all encoder hidden states and focus on the most relevant ones at each step of the generation process.

Training Seq2Seq Models

Seq2Seq models are trained end-to-end to maximize the probability of the ground-truth transcript given the audio. The loss function is typically categorical cross-entropy, calculated at each decoder timestep and then averaged over the entire sequence.

A common technique used to stabilize and accelerate the training of these models is teacher forcing. During training, instead of feeding the decoder's own (and possibly incorrect) prediction from the previous step as input to the current step, we feed the ground-truth token from the reference transcript. This prevents the model from compounding its own errors and helps it learn the alignment between audio and text more efficiently. However, it also creates a discrepancy between training (where inputs are always correct) and inference (where the model must rely on its own predictions), a problem known as exposure bias.

Inference and Decoding

During inference, there is no ground truth to guide the decoder. The model must generate the entire output sequence based on its own predictions. Finding the most probable sequence requires a search algorithm.

Greedy Search: At each step, this simple algorithm selects the single token with the highest probability and feeds it into the next step. It is fast but often produces suboptimal results because an early incorrect choice cannot be revised.
Beam Search: A more effective method that keeps track of the $k$ most likely partial sequences (the "beam") at each step. It looks at a larger part of the search space and generally yields more accurate transcriptions than greedy search. We will cover beam search in greater detail in the next chapter.

Advantages and Limitations

Seq2Seq models represent a significant advance over CTC-based approaches for many ASR tasks.

Advantages:

End-to-End Training: They directly map audio to text without needing pre-aligned data or separate components for acoustic and language modeling.
Contextual Awareness: The autoregressive decoder naturally functions as a language model, learning dependencies between output tokens and producing more fluent and coherent text.
Flexibility: They can handle non-monotonic alignments between the audio and the transcript, for example, when a speaker self-corrects ("I want a flight to... um, I mean, a train to Boston").

Limitations:

Not Ideal for Streaming: The standard encoder needs to process the entire audio sequence before the decoder can begin. This makes the basic architecture unsuitable for real-time applications where transcription must begin before the speaker has finished talking.
Computational Cost: The sequential, step-by-step nature of the decoder can be slower during inference compared to non-autoregressive models.
Potential for Errors: Like other generative models, they can sometimes get stuck in loops, repeat phrases, or "hallucinate" text that is not present in the audio, especially for very long or out-of-domain inputs.

These models laid the foundation for many modern architectures. In the next section, we will look at a specific and influential Seq2Seq model for ASR: Listen, Attend, and Spell (LAS).

Was this section helpful?

References

Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2014 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1409.0473 - Introduces the attention mechanism to address the fixed-context vector bottleneck in sequence-to-sequence models.
Listen, Attend and Spell, William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, 2016 International Conference on Acoustics, Speech and Signal Processing (ICASSP) DOI: 10.1109/ICASSP.2016.7472629 - Presents a seminal end-to-end sequence-to-sequence model for automatic speech recognition using attention.