Speech signals, whether represented as raw waveforms or extracted features like mel-spectrograms, are inherently sequential. The meaning of speech unfolds over time, and understanding or generating it requires models that can effectively capture temporal dependencies, often spanning long durations. While earlier sections reviewed foundational statistical models, deep learning provides powerful tools specifically designed for sequence modeling, forming the backbone of modern ASR and TTS systems. Feedforward networks, while useful for classification, lack the inherent structure to process variable-length sequences and maintain memory of past events. This section analyzes the architectures explicitly designed to handle sequential data, critical for advanced speech processing.
Recurrent Neural Networks (RNNs) were among the first deep learning architectures designed specifically for sequential data. The defining characteristic of an RNN is its recurrent connection: the output at a given timestep depends not only on the input at that timestep but also on the network's internal state (or "memory") from the previous timestep. This allows the network to retain information about past elements in the sequence.
Consider processing a sequence of audio feature vectors x=(x1,x2,...,xT). At each timestep t, the RNN updates its hidden state ht based on the current input xt and the previous hidden state ht−1. A typical formulation is:
ht=tanh(Whhht−1+Wxhxt+bh)Where Whh and Wxh are weight matrices, bh is a bias vector, and tanh is a common activation function. The output yt at timestep t can then be computed based on the hidden state:
yt=Whyht+byA simple RNN processing a sequence, unrolled through time. The hidden state ht depends on the current input xt and the previous hidden state ht−1.
While conceptually simple, standard RNNs struggle with learning long-range dependencies. During backpropagation through time (the process used to train RNNs), gradients can either vanish (become extremely small) or explode (become extremely large), making it difficult for the model to learn relationships between elements that are far apart in the sequence. This is a significant limitation for speech, where dependencies can span many frames (e.g., understanding context for disambiguation in ASR, or maintaining consistent prosody in TTS).
To address the vanishing gradient problem and improve the ability to capture long-term dependencies, gated RNN variants were developed. The most prominent are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU).
LSTMs introduce a more complex internal structure compared to simple RNNs. They incorporate a dedicated cell state (ct) alongside the hidden state (ht). Information flow into and out of the cell state, as well as updates to it, are controlled by three primary gates:
These gates are essentially small neural networks (typically with sigmoid or tanh activations) that learn to selectively pass, block, or modify information based on the current input and previous state. This gating mechanism allows LSTMs to maintain relevant information over much longer time scales than simple RNNs.
Diagram of an LSTM cell, highlighting the cell state (ct) and the gates (Forget, Input, Output) that regulate information flow. Actual implementations involve specific matrix operations.
GRUs are a newer, slightly simpler alternative to LSTMs. They also use gating mechanisms to control information flow but have only two gates and no separate cell state:
GRUs often perform comparably to LSTMs on many tasks, including speech processing, while being computationally slightly less expensive due to their simpler structure. Both LSTMs and GRUs became standard building blocks in ASR acoustic models, language models, and various components of TTS systems before the rise of Transformer architectures.
While LSTMs and GRUs improved the handling of long sequences, sequence-to-sequence (Seq2Seq) models built with them (often used in ASR and TTS) typically relied on compressing the entire input sequence into a single fixed-size context vector. This vector, representing the "meaning" of the input, was then passed to the decoder to generate the output sequence. This fixed-size vector becomes an information bottleneck, especially for long input sequences common in speech.
Attention mechanisms provide a way to overcome this bottleneck. Instead of relying solely on a single context vector, the decoder is allowed to "attend" to different parts of the entire input sequence at each step of the output generation.
How it works (Conceptual):
This allows the decoder to dynamically focus on the most relevant parts of the input audio (for ASR) or input text (for TTS) as it generates the output sequence, significantly improving performance, especially for long utterances and complex alignments. Attention became a fundamental component in state-of-the-art encoder-decoder models for both ASR and TTS.
The Transformer architecture, introduced initially for machine translation, revolutionized sequence modeling by demonstrating that recurrence is not strictly necessary. Transformers rely entirely on attention mechanisms, specifically self-attention, to model dependencies within the input and output sequences.
Key Components:
Simplified structure of a single Transformer block, showing the Multi-Head Attention and Feed-Forward Network layers, each followed by residual connections and layer normalization.
Advantages:
Transformers and their variants (like the Conformer, which combines Transformers with convolutions) have become the dominant architecture in state-of-the-art ASR systems (e.g., for acoustic modeling) and TTS systems (e.g., Transformer TTS for acoustic feature prediction). They form the basis for many of the advanced end-to-end models discussed in subsequent chapters.
Understanding these sequential architectures, from the foundational RNNs to the powerful Transformers, is essential. They provide the mechanisms to learn the complex temporal patterns inherent in speech signals and text sequences, enabling the development of sophisticated ASR and TTS systems capable of high performance and natural interaction. The specific ways these architectures are incorporated into end-to-end ASR models (Chapter 2) and advanced TTS models (Chapter 4) will build directly upon these concepts.
© 2025 ApX Machine Learning