Connectionist Temporal Classification (CTC) allows for training acoustic models without needing frame-by-frame alignment. However, CTC operates under a strong conditional independence assumption: the prediction at one timestep is independent of predictions at other timesteps. This limitation prevents the model from learning the linguistic structure inherent in the output sequence. Sequence-to-sequence (Seq2Seq) models offer a more powerful framework by directly modeling the probability of an entire output text sequence given an input audio sequence, P(text∣audio). This approach effectively learns both acoustic and linguistic patterns in a single, unified architecture.
At the heart of any Seq2Seq model is the encoder-decoder architecture. This design separates the task of processing the input from generating the output, providing a flexible structure for mapping sequences of different lengths.
The encoder's role is to process the entire input sequence of audio features, such as log-mel spectrograms, and condense this information into a set of high-level representations. Typically, the encoder is a multi-layer Recurrent Neural Network (RNN), often a Bidirectional Long Short-Term Memory (BiLSTM) network. By processing the sequence in both forward and backward directions, a BiLSTM encoder creates a hidden state at each timestep that contains information about the entire audio context surrounding that point.
The output of the encoder is a sequence of hidden state vectors, one for each input timestep. For an input spectrogram with T frames, the encoder produces a sequence of hidden states h1,h2,...,hT.
The decoder is an autoregressive RNN that generates the output transcript one token at a time (where a token can be a character, word, or sub-word unit). It functions as a conditional language model: given the encoded audio and the sequence of tokens already generated, it predicts the next token in the transcript.
At each step of the generation process, the decoder takes two primary inputs:
This step-by-step generation continues until the decoder produces a special end-of-sequence (<eos>) token.
The encoder processes audio frames (x1,...,xT) into hidden states (h1,...,hT). At each decoding step, the attention mechanism creates a context vector (ct) from all encoder states. The decoder uses this context and the previously generated token (yt−1) to produce the next token (yt).
Early Seq2Seq models attempted to compress all encoder hidden states into a single, fixed-size context vector. This created an information bottleneck, as a single vector struggled to retain all the necessary details from a long audio clip. The attention mechanism, which you were introduced to in the previous section, directly solves this problem by allowing the decoder to dynamically "look back" at all encoder hidden states and focus on the most relevant ones at each step of the generation process.
Seq2Seq models are trained end-to-end to maximize the probability of the ground-truth transcript given the audio. The loss function is typically categorical cross-entropy, calculated at each decoder timestep and then averaged over the entire sequence.
A common technique used to stabilize and accelerate the training of these models is teacher forcing. During training, instead of feeding the decoder's own (and possibly incorrect) prediction from the previous step as input to the current step, we feed the ground-truth token from the reference transcript. This prevents the model from compounding its own errors and helps it learn the alignment between audio and text more efficiently. However, it also creates a discrepancy between training (where inputs are always correct) and inference (where the model must rely on its own predictions), a problem known as exposure bias.
During inference, there is no ground truth to guide the decoder. The model must generate the entire output sequence based on its own predictions. Finding the most probable sequence requires a search algorithm.
Seq2Seq models represent a significant advance over CTC-based approaches for many ASR tasks.
Advantages:
Limitations:
These models laid the foundation for many modern architectures. In the next section, we will look at a specific and influential Seq2Seq model for ASR: Listen, Attend, and Spell (LAS).
Was this section helpful?
© 2026 ApX Machine LearningEngineered with