Following the success of attention mechanisms in sequence-to-sequence tasks, Transformer architectures, initially proposed for machine translation, have become highly influential in Automatic Speech Recognition. Unlike Recurrent Neural Networks (RNNs) which process input sequentially, Transformers rely entirely on self-attention mechanisms to compute representations of their input and output, allowing for significantly more parallel computation and capturing long-range dependencies more effectively.
At the heart of the Transformer is the self-attention mechanism. Recall the attention mechanism used in encoder-decoder models, where the decoder attended to the encoder's output. Self-attention applies this concept within a single sequence (either the input audio features or the output text sequence). It allows each position in the sequence to attend to all other positions, calculating a weighted sum of their representations based on relevance. This enables the model to directly relate different parts of the speech signal, regardless of their distance.
The most common form is Scaled Dot-Product Attention. For a given sequence element, we compute its query (Q), key (K), and value (V) vectors (typically through linear projections). The attention output is calculated as:
Attention(Q,K,V)=softmax(dkQKT)VHere, dk is the dimension of the key vectors. The scaling factor dk prevents the dot products from becoming too large, which could saturate the softmax function and hinder learning.
Instead of performing a single attention function, Transformers employ multi-head attention. The Q, K, and V vectors are projected multiple times (once per "head") with different, learned linear projections. Attention is computed independently for each head in parallel, and the results are concatenated and projected again.
Flow of Multi-Head Self-Attention. Input representations are projected into Queries, Keys, and Values for multiple heads. Attention is computed independently within each head, and the results are concatenated and projected to form the final output.
This allows the model to jointly attend to information from different representation subspaces at different positions. Effectively, different heads can learn different types of relationships (e.g., local acoustic patterns, long-range dependencies).
A typical Transformer-based ASR model consists of an encoder and, optionally, a decoder.
The encoder maps the input sequence of acoustic features (e.g., log-mel filterbanks) X=(x1,...,xT) to a sequence of contextualized representations Z=(z1,...,zT). It's usually a stack of identical layers. Each layer has two main sub-layers:
Residual connections are employed around each of the two sub-layers, followed by layer normalization. The output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
Since self-attention doesn't inherently process sequence order, positional information must be injected. This is done via Positional Encodings, which are added to the input embeddings at the bottom of the encoder stack. Common methods include fixed sinusoidal encodings or learned positional embeddings.
When used in an end-to-end sequence-to-sequence setup (like attention-based encoder-decoder models), a Transformer decoder is also used. It generates the output sequence (characters, phonemes, words) one token at a time (autoregressively). In addition to the self-attention and FFN sub-layers found in the encoder, the decoder inserts a third sub-layer: 3. Multi-Head Cross-Attention: This layer attends to the output of the encoder (Z). The queries come from the previous decoder layer, while the keys and values come from the encoder output. This allows the decoder to incorporate information from the input speech signal when predicting the next output token.
The self-attention sub-layer in the decoder is masked to prevent positions from attending to subsequent positions. This ensures that the prediction for position i can only depend on the known outputs at positions less than i, maintaining the autoregressive property.
Alternatively, the Transformer encoder can be used as a powerful feature extractor whose output is fed into a final linear layer followed by a softmax, trained with the CTC loss function discussed earlier. This leverages the Transformer's strength in capturing context without requiring an autoregressive decoder, often resulting in simpler and faster inference.
Advantages:
Challenges:
To address the challenge of modeling local correlations effectively, the Conformer architecture was introduced. It integrates Convolutional Neural Network (CNN) modules directly into the Transformer block structure. A typical Conformer block replaces the standard FFN sub-layer with a sequence of: an FFN module, a multi-head self-attention module, a convolution module, and another FFN module, all with appropriate normalization and residual connections.
Simplified view of a Conformer block, illustrating the combination of Feed Forward, Multi-Head Self-Attention, and Convolution modules with residual connections. Note: Actual residual connections often involve half-step Feed Forward modules sandwiching the main components.
The Conformer aims to capture both the global context modeling strengths of Transformers and the local feature extraction capabilities of CNNs, and it has become a widely adopted and highly effective architecture for acoustic modeling in state-of-the-art ASR systems.
In summary, Transformer architectures, particularly Conformers, represent a significant step forward in acoustic modeling for ASR. Their ability to model long-range dependencies effectively through self-attention, combined with computational parallelizability (and convolutional enhancements in Conformers), makes them a cornerstone of modern speech recognition systems. The practical implementation details and training strategies associated with these models will be explored further in the hands-on sections and subsequent chapters.
© 2025 ApX Machine Learning