Building on the foundational components of TTS systems, we now examine a dominant class of acoustic models responsible for generating intermediate representations like mel-spectrograms from input text: autoregressive models. These models generate the output sequence one step at a time, where each output prediction is conditioned on the previously generated outputs. This sequential generation process often leads to high-fidelity results, capturing fine-grained temporal dependencies in speech.
Two prominent families of autoregressive acoustic models are Tacotron and Transformer-based TTS systems.
The Autoregressive Principle in TTS
At its core, an autoregressive acoustic model predicts the sequence of acoustic feature frames Y=(y1,y2,...,yT) given an input text sequence X=(x1,x2,...,xN). The "autoregressive" property means the probability of the entire sequence is factorized as a product of conditional probabilities:
P(Y∣X)=∏t=1TP(yt∣y<t,C(X))
Here, yt is the acoustic feature vector (e.g., a mel-spectrogram frame) at time step t, y<t=(y1,...,yt−1) represents all previously generated frames, and C(X) is a context vector derived from the entire input text sequence X, typically provided by an encoder and an attention mechanism. The model generates y1, then uses y1 to help generate y2, and so on, until a special stop condition is met.
This step-by-step generation allows the model to maintain internal consistency within the generated audio signal, contributing to the naturalness of the output. However, this sequential dependency inherently limits inference speed, as generating the t-th frame requires the (t−1)-th frame to be computed first.
Tacotron and Tacotron 2
The Tacotron model, particularly its refinement Tacotron 2, represents a landmark end-to-end architecture for TTS using autoregressive generation. It directly maps input character or phoneme sequences to mel-spectrograms without requiring complex intermediate linguistic features or duration models found in older parametric systems.
Architecture Overview
A typical Tacotron 2 model consists of several key components:
- Encoder: Processes the input text sequence. It often starts with an embedding layer to convert input characters or phonemes into dense vectors. These embeddings are then fed into a stack of convolutional layers followed by a bidirectional LSTM (BiLSTM). The BiLSTM processes the sequence in both forward and backward directions, creating hidden states that capture contextual information from the entire input text for each input element.
- Attention Mechanism: Connects the encoder outputs to the decoder. At each decoder time step t, the attention mechanism computes alignment weights over the encoder outputs based on the previous decoder state. This allows the decoder to focus on the relevant parts of the input text when generating the current acoustic frame. Tacotron 2 uses a location-sensitive attention mechanism, which incorporates cumulative attention weights from previous steps, helping to ensure that the attention moves forward consistently through the input sequence and reducing alignment errors like skipping or repeating words.
- Decoder: An autoregressive recurrent neural network (typically an LSTM or GRU stack). At each step t, it receives the attention context vector (the weighted sum of encoder outputs) and the previously generated mel-spectrogram frame yt−1 as input. It then predicts the current mel-spectrogram frame yt and updates its internal state.
- Stop Token Prediction: Alongside predicting the mel-spectrogram frame, the decoder also predicts the probability that generation should stop at the current step. This is usually done via a linear layer followed by a sigmoid activation, trained as a binary classification task. Generation halts when this probability exceeds a predefined threshold.
- Post-Net (Optional but common): A stack of convolutional layers applied to the generated mel-spectrogram sequence. This acts as a residual refinement module, improving the quality of the final output by predicting a correction term that is added to the decoder's output spectrogram.
Simplified architecture of an attention-based autoregressive acoustic model like Tacotron 2. Text is encoded, attention aligns encoder outputs with decoder steps, and the decoder generates mel-spectrogram frames sequentially, using the previous frame as input for the next.
Training and Characteristics
Tacotron models are trained end-to-end using a dataset of (text, audio) pairs. The audio is converted to mel-spectrograms, which serve as the ground truth targets for the decoder. A common loss function is the mean squared error (MSE) or L1 loss between the predicted and target mel-spectrograms, summed over all frames. An additional binary cross-entropy loss is used for the stop token prediction.
During training, a technique called "teacher forcing" is typically employed. Instead of feeding the model's own predicted frame from the previous step as input to the decoder, the ground truth frame from the previous step is provided. This stabilizes training, especially in the early stages, by preventing the accumulation of errors. However, it introduces a mismatch between training and inference (known as exposure bias), as during inference, the model must rely on its own, potentially imperfect, predictions.
Tacotron 2 set a high standard for TTS quality, capable of generating speech nearly indistinguishable from human recordings for many speakers and domains. Its main drawback remains the slow, sequential inference process dictated by its autoregressive nature.
Transformer TTS
Inspired by the success of the Transformer architecture in natural language processing and subsequently in ASR, researchers adapted it for TTS. Transformer TTS models replace the recurrent layers (LSTMs/GRUs) in both the encoder and decoder with multi-headed self-attention and feed-forward layers.
Architecture Overview
- Encoder: Similar in function to Tacotron's encoder, it maps the input text sequence to a sequence of contextual representations. It consists of a stack of Transformer blocks. Each block contains a multi-headed self-attention layer (allowing each input token to attend to all other input tokens) followed by position-wise feed-forward networks. Positional encodings are added to the input embeddings to provide the model with information about the order of the input tokens, as the self-attention mechanism itself is permutation-invariant.
- Decoder: Also composed of a stack of Transformer blocks, but with a modification to ensure autoregressive behavior. Each decoder block includes:
- A masked multi-headed self-attention layer: This attends to the previously generated acoustic frames. The masking ensures that the prediction for frame t can only depend on frames 1 to t−1, preserving causality.
- A multi-headed cross-attention layer: This attends to the encoder outputs, similar to the attention mechanism in Tacotron, allowing the decoder to incorporate information from the input text.
- Position-wise feed-forward networks.
Like the encoder, the decoder also uses positional encodings for the input acoustic frames. The output of the final decoder layer is typically passed through a linear layer to predict the mel-spectrogram frame and another linear layer for the stop token probability.
- Post-Net: Often included, similar to Tacotron 2, to refine the output spectrogram.
Comparison with Tacotron
- Parallelization: While inference remains sequential, the Transformer's non-recurrent nature allows for significantly more parallel computation during training compared to RNNs, potentially leading to faster training times on suitable hardware (like GPUs/TPUs).
- Long-Range Dependencies: Self-attention mechanisms can theoretically model long-range dependencies more effectively than RNNs, as the path length between any two positions is constant (O(1)) rather than depending on the sequence length (O(N)). This can be advantageous for generating speech with complex prosodic patterns tied to overall sentence structure.
- Training Stability: Training Transformers for TTS can sometimes require more careful hyperparameter tuning (e.g., learning rate schedules like Noam decay, warm-up steps) compared to RNN-based models. Attention convergence, especially for long sequences, can also be a point of focus.
Common Considerations
- Input Representation: Both Tacotron and Transformer TTS can work with characters or phonemes. Phonemes often lead to better pronunciation accuracy, especially for out-of-vocabulary words, but require an external grapheme-to-phoneme (G2P) conversion step. Character-based models learn pronunciation implicitly but may struggle with heteronyms or complex names.
- Output Target: Mel-spectrograms are the standard intermediate representation. They are perceptually relevant, smoother, and easier to predict with L1/L2 losses compared to highly complex raw audio waveforms. These mel-spectrograms are then passed to a separate neural vocoder (covered in Chapter 5) to synthesize the final waveform.
- Attention Alignment: Visualizing the attention weights during inference is a common diagnostic technique. A well-behaved attention mechanism should show a roughly diagonal alignment, indicating that the model is progressing through the input text as it generates the output speech frames. Failures often manifest as jerky, non-monotonic alignments, leading to repeated or skipped words.
Autoregressive models like Tacotron and Transformer TTS represent a significant advancement in generating natural-sounding speech directly from text. Their ability to capture detailed acoustic variations comes at the cost of sequential inference. This limitation motivates the development of non-autoregressive models, which we will explore in the next section.