Listen, Attend, and Spell (LAS) Architecture

While Connectionist Temporal Classification (CTC) provides a way to train an acoustic model without explicit alignment, its core assumption of conditional independence between output predictions is a significant limitation. Speech is inherently structured; the sound for "p" in "apple" is influenced by the preceding "a" and the following "l". To capture these dependencies, we need an architecture that models the output sequence directly. The Listen, Attend, and Spell (LAS) model was a groundbreaking end-to-end architecture that did exactly this.

Developed by Google researchers, LAS frames speech recognition as a sequence-to-sequence (Seq2Seq) problem, much like machine translation. It directly translates a sequence of audio features into a sequence of characters or words. The name itself neatly describes its three main components: a "Listener" that processes the audio, an "Attention" mechanism that focuses on relevant parts of it, and a "Speller" that generates the transcript.

The Three Components of LAS

The LAS architecture is an elegant composition of an encoder, a decoder, and an attention mechanism that links them.

The Listener (Encoder): This component's job is to receive the input audio features, such as log-mel spectrograms, and transform them into a higher-level representation. It "listens" to the entire utterance.
The Attention Mechanism: This is the bridge between the Listener and the Speller. At each step of generating the output, it decides which part of the encoded audio is most relevant.
The Speller (Decoder): This component takes the encoded representation and, guided by the attention mechanism, generates the output transcript one token (e.g., a character or word piece) at a time. It "spells" out the transcription.

Let's look at each of these parts in more detail.

The Listener: An Acoustic Encoder

The Listener functions as the acoustic model's encoder. Its primary goal is to learn a rich, compact representation of the input speech. Typically, the Listener is implemented as a stack of recurrent neural networks, most commonly bidirectional LSTMs (BLSTMs).

The input to the Listener is a sequence of feature vectors, $X = (x_1, x_2, ..., x_T)$ , where $T$ is the number of time steps in the audio. The BLSTM processes this sequence and produces a set of high-level feature vectors or encoder hidden states, $H = (h_1, h_2, ..., h_T')$ .

Because LSTMs process sequences, they are well-suited to capturing temporal patterns in speech. Using a bidirectional LSTM is particularly effective because it processes the audio in both forward and backward directions, allowing each hidden state $h_i$ to contain information about the entire utterance, not just the past.

Often, the Listener includes a pyramidal structure (pBLSTM), where consecutive time steps are merged in higher layers. This progressively reduces the temporal length of the sequence ( $T' < T$ ), creating a more compact representation and reducing the computational load for the attention mechanism.

The Listener processes input features through stacked pBLSTM layers to produce high-level hidden states.

The Speller: An Attentional Decoder

The Speller is an autoregressive decoder, meaning it generates the output sequence one token at a time, and each new prediction is conditioned on the previously generated tokens. This is where LAS explicitly models the output dependencies that CTC ignores.

The Speller is usually a unidirectional LSTM or GRU. At each decoding step $i$ , it performs two actions:

It uses an attention mechanism to compute a context vector, $c_i$ . This vector is a summary of the Listener's output states, weighted by how relevant each one is to generating the current output token $y_i$ .
It takes its own previous hidden state ( $s_{i-1}$ ), the previously generated token ( $y_{i-1}$ ), and the new context vector ( $c_i$ ) as input to its RNN cell. It then updates its hidden state to $s_i$ and predicts the probability distribution for the current token, $P(y_i | y_{<i}, X)$ .

The process starts with a special start-of-sequence token <SOS> and continues until an end-of-sequence token <EOS> is generated.

The attention mechanism is the heart of this process. For each decoder step $i$ , it compares the current decoder state $s_{i-1}$ with all the encoder hidden states $H = (h_1, h_2, ..., h_T')$ . This comparison produces a set of attention scores, or weights, which are then used to compute a weighted average of the encoder states. This average is the context vector $c_i$ .

\alpha_{ij} = \frac{\exp(\text{score}(s_{i-1}, h_j))}{\sum_{k=1}^{T'} \exp(\text{score}(s_{i-1}, h_k))}

c_i = \sum_{j=1}^{T'} \alpha_{ij} h_j

This allows the Speller to "pay attention" to the specific audio segment corresponding to the phoneme it is trying to transcribe at that moment. For example, when generating the character "p" in "apple", the attention weights would be highest for the encoder hidden states that correspond to the /p/ sound in the original audio.

The Full LAS Architecture

Putting it all together, the LAS model operates in a loop during inference.

Listen: The entire audio input is fed through the Listener (encoder) once to get the complete set of hidden states $H$ .
Attend and Spell: The Speller (decoder) begins its generation process.
- At step $i$ , the decoder's state $s_{i-1}$ is used to query the encoder states $H$ .
- The attention mechanism calculates weights and produces a context vector $c_i$ .
- The decoder RNN takes $c_i$ and the previous token $y_{i-1}$ to predict the next token $y_i$ .
- This process repeats until an <EOS> token is produced.

The complete LAS architecture. The Listener encodes the audio, and for each output step, the Speller uses the Attention mechanism to create a context vector from the encoder states to predict the next character.

Training and Limitations

LAS models are trained end-to-end using a standard cross-entropy loss between the predicted token probabilities and the ground-truth transcript. A common technique called teacher forcing is used during training, where the decoder is fed the correct previous token from the ground truth transcript at each step, rather than its own (potentially incorrect) prediction. This stabilizes training and helps the model learn faster. For inference, since the ground truth is unavailable, algorithms like beam search are used to explore multiple candidate transcriptions and find the most probable one.

While LAS was a major step forward, it has limitations. The strictly left-to-right, autoregressive nature of the Speller can be slow for inference and can struggle with very long utterances where the attention mechanism might lose focus. These challenges helped motivate the shift towards fully attention-based models like the Transformer, which we will discuss next.

Was this section helpful?

References

Listen, Attend and Spell, William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, 2015 arXiv preprint arXiv:1508.01211 DOI: 10.48550/arXiv.1508.01211 - Introduces the end-to-end Listen, Attend, and Spell (LAS) model for speech recognition, detailing its encoder-decoder architecture with attention.
Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014 Advances in Neural Information Processing Systems 27 (NIPS 2014), Vol. 27 (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) DOI: 10.48550/arXiv.1409.3215 - Presents the foundational sequence-to-sequence learning framework, which serves as the basis for the LAS model's approach to ASR.
Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, 2014 International Conference on Learning Representations (ICLR 2015) DOI: 10.48550/arXiv.1409.0473 - Introduces the attention mechanism, a core component that allows the decoder to selectively focus on parts of the input sequence, adopted by LAS.
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML '06) (Association for Computing Machinery) DOI: 10.1145/1150447.1150493 - Describes Connectionist Temporal Classification (CTC), providing important context for the development of LAS by highlighting earlier sequence modeling methods.