While Connectionist Temporal Classification (CTC) provides a way to train an acoustic model without explicit alignment, its core assumption of conditional independence between output predictions is a significant limitation. Speech is inherently structured; the sound for "p" in "apple" is influenced by the preceding "a" and the following "l". To capture these dependencies, we need an architecture that models the output sequence directly. The Listen, Attend, and Spell (LAS) model was a groundbreaking end-to-end architecture that did exactly this.
Developed by Google researchers, LAS frames speech recognition as a sequence-to-sequence (Seq2Seq) problem, much like machine translation. It directly translates a sequence of audio features into a sequence of characters or words. The name itself neatly describes its three main components: a "Listener" that processes the audio, an "Attention" mechanism that focuses on relevant parts of it, and a "Speller" that generates the transcript.
The LAS architecture is an elegant composition of an encoder, a decoder, and an attention mechanism that links them.
Let's look at each of these parts in more detail.
The Listener functions as the acoustic model's encoder. Its primary goal is to learn a rich, compact representation of the input speech. Typically, the Listener is implemented as a stack of recurrent neural networks, most commonly bidirectional LSTMs (BLSTMs).
The input to the Listener is a sequence of feature vectors, X=(x1,x2,...,xT), where T is the number of time steps in the audio. The BLSTM processes this sequence and produces a set of high-level feature vectors or encoder hidden states, H=(h1,h2,...,hT′).
Because LSTMs process sequences, they are well-suited to capturing temporal patterns in speech. Using a bidirectional LSTM is particularly effective because it processes the audio in both forward and backward directions, allowing each hidden state hi to contain information about the entire utterance, not just the past.
Often, the Listener includes a pyramidal structure (pBLSTM), where consecutive time steps are merged in higher layers. This progressively reduces the temporal length of the sequence (T′<T), creating a more compact representation and reducing the computational load for the attention mechanism.
The Listener processes input features through stacked pBLSTM layers to produce high-level hidden states.
The Speller is an autoregressive decoder, meaning it generates the output sequence one token at a time, and each new prediction is conditioned on the previously generated tokens. This is where LAS explicitly models the output dependencies that CTC ignores.
The Speller is usually a unidirectional LSTM or GRU. At each decoding step i, it performs two actions:
The process starts with a special start-of-sequence token <SOS> and continues until an end-of-sequence token <EOS> is generated.
The attention mechanism is the heart of this process. For each decoder step i, it compares the current decoder state si−1 with all the encoder hidden states H=(h1,h2,...,hT′). This comparison produces a set of attention scores, or weights, which are then used to compute a weighted average of the encoder states. This average is the context vector ci.
αij=∑k=1T′exp(score(si−1,hk))exp(score(si−1,hj)) ci=j=1∑T′αijhjThis allows the Speller to "pay attention" to the specific audio segment corresponding to the phoneme it is trying to transcribe at that moment. For example, when generating the character "p" in "apple", the attention weights would be highest for the encoder hidden states that correspond to the /p/ sound in the original audio.
Putting it all together, the LAS model operates in a loop during inference.
<EOS> token is produced.The complete LAS architecture. The Listener encodes the audio, and for each output step, the Speller uses the Attention mechanism to create a context vector from the encoder states to predict the next character.
LAS models are trained end-to-end using a standard cross-entropy loss between the predicted token probabilities and the ground-truth transcript. A common technique called teacher forcing is used during training, where the decoder is fed the correct previous token from the ground truth transcript at each step, rather than its own (potentially incorrect) prediction. This stabilizes training and helps the model learn faster. For inference, since the ground truth is unavailable, algorithms like beam search are used to explore multiple candidate transcriptions and find the most probable one.
While LAS was a major step forward, it has limitations. The strictly left-to-right, autoregressive nature of the Speller can be slow for inference and can struggle with very long utterances where the attention mechanism might lose focus. These challenges helped motivate the shift towards fully attention-based models like the Transformer, which we will discuss next.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with