Conformer: Combining CNNs and Transformers

While the Transformer architecture excels at capturing relationships across an entire audio sequence, it is not inherently optimized for learning the fine-grained local patterns that are significant in speech, such as phoneme transitions or co-articulation effects. A pure self-attention model treats every time step as equally distant in terms of computation, potentially overlooking the special importance of adjacent feature vectors. On the other hand, Convolutional Neural Networks (CNNs) are exceptionally good at detecting local patterns by sliding a kernel across the input, but they struggle to model long-range dependencies.

To get the benefits of both approaches, researchers at Google proposed the Conformer architecture. It effectively combines the self-attention mechanism of Transformers with the local pattern detection of CNNs into a single, powerful block. This hybrid design has become a foundation for many state-of-the-art ASR systems due to its ability to model both the local and global context of a speech utterance.

The Conformer Block

The core of the architecture is the Conformer block, which processes the sequence of input features. Unlike a standard Transformer block, which consists of a self-attention layer followed by a feed-forward network, the Conformer block inserts a convolution module in the middle and arranges the components in a specific "macaron-like" structure. A macaron has two identical cookies with a filling in the middle; similarly, the Conformer block has two half-step feed-forward layers sandwiching the attention and convolution modules.

A single block is composed of four main modules, each wrapped with a residual connection and layer normalization:

A feed-forward module.
A multi-head self-attention module.
A convolution module.
A final feed-forward module.

The data flow through a Conformer block. The input is first processed by a feed-forward network, then the self-attention module, the convolution module, and a final feed-forward network, with residual connections at each step.

Let's examine the purpose of each component in this structure.

Feed-Forward Modules

The block uses two half-step feed-forward networks, one at the beginning and one at the end. These are standard position-wise feed-forward networks, similar to those in a Transformer. The intuition behind splitting them is to help with gradient flow during training. Each feed-forward module typically consists of two linear layers with a non-linear activation function in between, like Swish or GELU.

Multi-Head Self-Attention Module

This is the standard multi-head self-attention mechanism from the Transformer architecture. Its function is to calculate an attention score for every pair of time steps in the input sequence, allowing the model to weigh the importance of different parts of the audio when processing a specific time step. This is where the model captures long-range, global dependencies like grammar and sentence context. For ASR, this module often uses relative positional encodings, which are better suited for speech than the absolute encodings used in the original Transformer.

Convolution Module

This is the distinguishing feature of the Conformer. After the self-attention module has processed the global context, the convolution module is applied to explicitly learn localized patterns. It typically consists of a point-wise convolution, followed by a 1D depthwise separable convolution, and another point-wise convolution.

Point-wise Convolution: Projects the input into a higher-dimensional space.
1D Depthwise Separable Convolution: This is a highly efficient type of convolution that applies a single filter to each input channel independently. For ASR, this means it slides a kernel across the time dimension to capture local acoustic-phonetic features, such as how formants change over a few milliseconds.
Batch Normalization and Activation: These are applied after the convolution to stabilize training and introduce non-linearity.

By inserting this module, the Conformer directly encodes translation-invariant local correlations, meaning it can recognize a specific acoustic pattern regardless of where it appears in the audio stream.

Why This Combination Works

The Conformer's effectiveness comes from its clever ordering of operations. The self-attention module first identifies global relationships in the sequence. Then, the convolution module refines these representations by focusing on local structure. This allows the model to learn features that benefit from both global context and local acoustic details simultaneously.

For example, to transcribe the word "cat," the self-attention module might use the broader sentence context to increase the probability that the word is a noun. At the same time, the convolution module can focus on the precise acoustic transitions between the /k/, /æ/, and /t/ sounds. The final feed-forward network integrates these two sources of information before passing the result to the next Conformer block.

This architecture has consistently demonstrated superior performance on major ASR benchmarks compared to pure Transformer or LSTM-based models. Its ability to efficiently learn a richer set of features makes it a go-to choice for building high-accuracy acoustic models. Many of the most powerful pre-trained models available today, which you will learn to use next, are built upon this Conformer design.

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems (NeurIPS 2017) DOI: 10.48550/arXiv.1706.03762 - The seminal paper that introduced the Transformer model, which is a core component providing the global context modeling capability of the Conformer architecture.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, 2017 arXiv preprint arXiv:1704.04861 DOI: 10.48550/arXiv.1704.04861 - Introduces and details depthwise separable convolutions, a highly efficient form of convolution used within the Conformer's convolution module for local pattern detection.