Connectionist Temporal Classification (CTC) provides an effective way to train an acoustic model without needing a pre-aligned dataset. However, CTC operates on a strong assumption: that the model's output at each time step is conditionally independent of all other outputs. This means the probability of predicting "c" at time t doesn't depend on the model having predicted "a" at time t−1. This limitation prevents the model from learning the linguistic dependencies between characters in the output sequence.
To address this, we can use an attention mechanism. Think of how a person transcribes audio. They don't process the entire soundwave in one go and then write the full sentence. Instead, they listen, write a few words, and might even replay a small segment to catch a difficult word. Attention allows a model to mimic this behavior by selectively focusing on different parts of the input audio sequence when generating each part of the output transcript. It enables the model to weigh the importance of different audio frames for each output character or word, effectively creating a dynamic and context-sensitive alignment.
An attention mechanism typically operates as a bridge between an encoder and a decoder.
The process of computing the context vector for each output step can be broken down into three parts. Let's assume the decoder is about to generate the i-th character of the transcript. It uses its previous hidden state, si−1, to query the encoder's output vectors, h1,...,hT.
First, the model needs a way to score how well each input frame hj aligns with the current output being generated, which is represented by the decoder state si−1. This is done using a scoring function. A common approach, known as additive attention, uses a small feed-forward neural network:
eij=score(si−1,hj)The score, eij, quantifies the relevance of the j-th audio frame to the i-th output character. A higher score means the frame is more important for the current decoding step.
The raw scores eij are not very useful on their own because their scale can vary wildly. To normalize them into a more interpretable format, we pass them through a softmax function. This converts the scores into a probability distribution, called attention weights, denoted by αij.
Each weight αij is a value between 0 and 1, and all the weights for a given decoding step i sum to 1 (∑j=1Tαij=1). You can think of these weights as the amount of "attention" the decoder should pay to each specific audio frame when generating the current output character.
Finally, the context vector, ci, is calculated as the weighted sum of all the encoder hidden states. The weights used in this sum are the attention weights αij we just computed.
ci=j=1∑TαijhjThis context vector is a summary of the input audio, tailored specifically for generating the i-th output character. It contains the most relevant acoustic information needed for the current prediction. The decoder then uses this context vector, along with its own hidden state, to predict the next character in the sequence. This entire process repeats until the decoder generates a special end-of-sequence token.
One of the great advantages of attention is its interpretability. By plotting the attention weights αij in a heatmap, we can see exactly what parts of the input audio the model is focusing on for each output character.
The alignment between input audio frames and the output characters for the word "HELLO". The darker blue indicates higher attention weights. Notice the strong diagonal pattern, showing that as the model generates the transcript from 'H' to 'O', its focus moves progressively through the audio frames.
This clear, monotonic alignment is typical for speech recognition and provides a valuable tool for debugging. If the attention pattern looks chaotic or nonsensical, it often indicates a problem with model training.
By incorporating attention, our ASR models are no longer constrained by the rigid assumptions of CTC. They can learn a soft, data-driven alignment between audio and text, leading to significant improvements in transcription accuracy, especially for longer and more complex utterances. This mechanism is a foundational element of the modern sequence-to-sequence models we will discuss next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with