While Connectionist Temporal Classification (CTC) provides an elegant way to handle variable-length audio inputs without needing explicit alignments during training, it makes a strong conditional independence assumption: the prediction at each time step is independent of other predictions, given the input audio. This can limit its ability to model the inherent dependencies within the output text sequence.
Sequence-to-sequence (Seq2Seq) models, particularly those enhanced with attention mechanisms, offer an alternative approach that directly addresses this limitation. Originally popularized in machine translation, these models have proven highly effective for ASR, directly mapping an input sequence of acoustic features X=(x1,...,xT) to an output sequence of characters or phonemes Y=(y1,...,yU).
At its core, an attention-based Seq2Seq model for ASR consists of two main components: an Encoder and a Decoder, linked by an Attention mechanism.
Basic architecture of an Attention-Based Encoder-Decoder model for ASR. The encoder processes the input audio features, the attention mechanism calculates a context vector based on encoder outputs and the current decoder state, and the decoder generates the output sequence token by token.
Encoder: This component processes the entire input sequence of acoustic features (x1,...,xT). Typically, it's implemented using recurrent neural networks like LSTMs or GRUs (often bidirectional to capture context from both past and future frames) or even stacked convolutional layers followed by RNNs. The encoder's role is to transform the input features into a sequence of higher-level representations, often called hidden states or annotations (h1,...,hT). Each ht ideally summarizes the relevant information around time step t in the input audio.
ht=Encoder(x1,...,xT)tDecoder: This component generates the output text sequence (y1,...,yU) one token at a time. It's usually an autoregressive RNN (LSTM/GRU). At each output step u, the decoder receives the previously generated token yu−1 and its own previous hidden state su−1 as input. Crucially, it also receives a context vector cu, which is provided by the attention mechanism. Based on these inputs, it updates its hidden state su and predicts the probability distribution for the next output token yu over the vocabulary (e.g., characters, subwords).
su=DecoderRNN(su−1,yu−1,cu) P(yu∣y<u,X)=softmax(OutputLayer(su,cu))The fixed-length context vector limitation of early Seq2Seq models (where the encoder summarized the entire input into a single vector) is problematic for long sequences like speech. The attention mechanism overcomes this by allowing the decoder to dynamically focus on different parts of the entire encoded input sequence (h1,...,hT) when generating each output token yu.
At each decoder step u, the attention mechanism calculates a context vector cu as a weighted sum of the encoder hidden states ht:
cu=t=1∑TαuthtThe weights αut are called attention weights. They determine how much "attention" the decoder should pay to the encoder state ht when predicting the output token yu. These weights are calculated based on the similarity or alignment between the current decoder state su−1 (acting as a "query") and each encoder hidden state ht (acting as "keys").
Calculate Alignment Scores: An alignment model eut scores how well the input around time t matches the output at position u. A common choice is additive attention (Bahdanau-style):
eut=vaTtanh(Wasu−1+Vaht+ba)Here, va, Wa, Va, and ba are learnable weight matrices and biases of the attention mechanism. Another popular option is dot-product attention (Luong-style), which is computationally simpler if the dimensions match:
eut=su−1TWahtOr simply eut=su−1Tht if dimensions allow direct dot products.
Normalize Scores to Weights: The scores eut are normalized using a softmax function across all input time steps to obtain the attention weights αut. These weights sum to 1.
αut=∑k=1Texp(euk)exp(eut)Compute Context Vector: The context vector cu is computed as the weighted sum shown earlier. This vector provides a summary of the input audio specifically relevant for generating the output token yu.
This dynamic weighting allows the model to, for instance, focus on the audio segment corresponding to the phoneme /k/ when predicting the character 'c', then shift its focus to the /æ/ segment for 'a', and finally the /t/ segment for 't' when transcribing the word "cat".
Training: Attention-based models are typically trained end-to-end using maximum likelihood estimation. The objective is to maximize the probability of the correct output sequence given the input audio. The loss function is usually the sum or average of the cross-entropy losses at each decoder step:
L=−u=1∑UlogP(yu∗∣y<u∗,X)where y∗ is the ground truth sequence. During training, a technique called teacher forcing is commonly used. Instead of feeding the decoder's own prediction from the previous step (yu−1) as input for the current step, the ground truth token (yu−1∗) is provided. This stabilizes training and speeds up convergence, though it can lead to a mismatch between training and inference conditions (known as exposure bias).
Inference (Decoding): Generating the output sequence at test time requires producing one token at a time. Since the ground truth is unavailable, the decoder uses its own previously predicted token as input for the next step. Finding the most likely sequence Y requires searching through the space of possible output sequences.
Advantages:
Disadvantages:
Attention-based encoder-decoder models represent a significant step in acoustic modeling, moving away from frame-wise classification with strong independence assumptions towards generating output sequences conditioned on the entire input history and relevant parts of the audio input. While subsequent architectures like the RNN Transducer (discussed next) and Transformers offer improvements, particularly for streaming and parallelization, understanding the attention mechanism is fundamental to modern ASR.
© 2025 ApX Machine Learning