While autoregressive models like Tacotron 2 produce high-quality speech by generating acoustic features one frame at a time, this sequential dependency creates an inherent bottleneck during inference. Generating a few seconds of audio can require thousands of sequential steps, making real-time synthesis challenging, especially on resource-constrained devices. Non-autoregressive (NAR) models tackle this limitation head-on by aiming to generate all acoustic feature frames in parallel.
The core idea behind NAR TTS is compelling: predict the entire mel-spectrogram sequence (or significant chunks) simultaneously given the input text representation. This drastically reduces the number of sequential operations required during synthesis, leading to substantial speedups. However, this parallelism introduces its own set of challenges, primarily centered around the inherent "one-to-many" mapping problem between text and speech.
The One-to-Many Mapping Problem in NAR TTS
Consider a single phoneme in the input sequence. Its acoustic realization, specifically its duration and precise spectral characteristics, can vary significantly based on context (surrounding phonemes), speaking rate, emphasis, and overall prosody. Autoregressive models handle this implicitly; the prediction of the current frame is conditioned on the previously generated frames, allowing the model to learn these complex temporal dependencies.
NAR models, lacking this sequential conditioning during parallel generation, struggle with this ambiguity. If a model simply tries to predict an "average" representation for each phoneme, the resulting speech often sounds muffled, lacks prosodic variation, and may exhibit unnatural timing or rhythm. Early attempts at NAR TTS often suffered from these quality limitations compared to their autoregressive counterparts. The central difficulty lies in determining the correct duration for each input token (phoneme or character) and generating diverse acoustic features without explicit step-by-step guidance.
Explicit Duration Modeling: The Key Enabler
A significant breakthrough in NAR TTS involved explicitly modeling the duration of each input unit. Instead of letting the model implicitly figure out how long each phoneme should last, a dedicated component predicts this duration, which is then used to align the input representation with the output acoustic feature sequence.
The typical workflow involves these steps:
- Encoder: The input text sequence (often phonemes) is processed by an encoder (commonly a Transformer or CNN-based network) to obtain hidden state representations for each input token.
- Duration Predictor: A separate module takes the encoder outputs and predicts a duration (number of mel-spectrogram frames) for each input token. This predictor is trained to estimate these durations accurately.
- Length Regulator / Upsampling: The encoder outputs are expanded based on the predicted durations. For instance, if the phoneme /ae/ is predicted to have a duration of 5 frames, its corresponding hidden state vector from the encoder is repeated 5 times. This creates an expanded hidden sequence whose length matches the target mel-spectrogram length.
- Decoder: The expanded hidden sequence is fed into a decoder (again, often Transformer or CNN-based) which generates the final mel-spectrogram frames in parallel.
Crucially, the duration predictor needs supervision during training. How are these target durations obtained?
- Teacher-Student Training: An effective approach involves using a pre-trained, high-quality autoregressive model (the "teacher," e.g., Tacotron 2) to synthesize speech for the training data. Forced alignment techniques are then used to determine the frame-level alignment between the input phonemes and the teacher-generated spectrograms, providing the target durations.
- External Aligners: Alternatively, traditional HMM-based alignment tools (like the Montreal Forced Aligner) or purpose-built neural aligners can be used with the ground truth audio to extract durations.
FastSpeech and FastSpeech 2: Milestones in NAR TTS
The FastSpeech family of models represents a significant advancement in practical and high-quality non-autoregressive synthesis.
FastSpeech
FastSpeech was one of the first models to successfully employ the duration predictor and length regulator mechanism described above, demonstrating substantial inference speedups (hundreds of times faster than comparable autoregressive models) while maintaining reasonable quality.
Its architecture typically consists of:
- An FFT Block based encoder (using self-attention and 1D convolutions, similar to Transformer blocks).
- A Duration Predictor (often a simple CNN stack with LayerNorm and ReLU activations) trained with Mean Squared Error (MSE) loss against target durations derived from a teacher model.
- A Length Regulator that performs the upsampling based on predicted durations.
- An FFT Block based decoder that generates the mel-spectrogram from the expanded sequence.
High-level structure of FastSpeech. Input phonemes are encoded, durations are predicted and used by the Length Regulator to expand the encoder states, and the decoder generates the mel-spectrogram in parallel. Training relies on alignments and spectrograms from a teacher model.
While effective, FastSpeech's reliance on a teacher model for alignments ("knowledge distillation") meant the quality was inherently limited by the teacher, and the two-stage training process could be cumbersome. Duration prediction errors could also lead to audible artifacts.
FastSpeech 2
FastSpeech 2 addressed several limitations of its predecessor:
- Elimination of Teacher Model Dependency: It trains directly using ground truth audio features. Target durations are typically extracted using an external aligner or a purpose-built alignment mechanism during training. This simplifies the training pipeline and potentially allows surpassing the teacher model's quality.
- Variance Adaptor: To improve prosody and naturalness, FastSpeech 2 introduces a "Variance Adaptor" module. In addition to the Duration Predictor, it includes predictors for other acoustic variance information, such as pitch (F0 contour) and energy (frame-level magnitude). These are also predicted per-input-token, expanded using the predicted durations, and then added as conditional inputs to the decoder. This allows for more expressive and less monotonous speech generation.
The Variance Adaptor provides finer-grained control over the output speech characteristics and helps mitigate the "averaging" effect often seen in earlier NAR models. Training these variance predictors typically uses MSE or L1 loss against ground truth values extracted from the training audio.
Other Non-Autoregressive Approaches
While FastSpeech is prominent, other NAR architectures exist:
- ParaNet: An early NAR model that used iterative refinement, generating a coarse spectrogram and then refining it.
- Glow-TTS: Leverages normalizing flows for alignment and parallel mel-spectrogram generation, offering an alternative mathematical framework.
- Parallel Tacotron: Explores iterative methods and attention mechanisms to distill knowledge from AR models for parallel decoding.
- VITS (Variational Inference with adversarial learning for end-to-end Speech Synthesis): A notable model that integrates parallel acoustic feature generation and waveform synthesis within a single VAE framework using adversarial training and stochastic duration prediction. While VITS is end-to-end (text-to-waveform), its acoustic generation part shares principles with NAR models discussed here.
Advantages and Disadvantages of NAR Models
Advantages:
- Inference Speed: The primary benefit. Parallel generation enables orders-of-magnitude faster synthesis compared to AR models, making them ideal for low-latency applications.
- Robustness: Less susceptible to repetition or skipping errors that can plague autoregressive models during long-form synthesis, as each part is generated more independently.
- Controllability: Explicit duration prediction allows direct manipulation of speech tempo. Variance predictors (as in FastSpeech 2) offer control over pitch and energy.
Disadvantages:
- Quality Dependence on Duration/Variance Prediction: The overall naturalness heavily relies on the accuracy of the duration, pitch, and energy predictors. Errors in these predictions can lead to unnatural rhythm, pitch contours, or volume variations. This remains an active area of research.
- Data Requirements: Accurate duration/variance targets are essential for training. This necessitates either reliable alignment tools or high-quality teacher models.
- Potential for Monotony (addressed by variance predictors): Without mechanisms like variance adaptors, basic NAR models can sometimes produce flatter, less expressive speech than the best AR models due to the difficulty in modeling complex prosodic dependencies in parallel.
Implementation Considerations
When implementing or training NAR TTS models:
- Alignment Quality: Ensure the method for obtaining target durations (teacher model alignment, external aligner) is accurate and robust. Misalignments are a common source of poor quality.
- Predictor Architectures: Experiment with different architectures (e.g., CNNs, Transformers) for the duration and variance predictors. Regularization (like dropout) is often important.
- Loss Functions: L1 or L2 (MSE) losses are standard for spectrograms and variance features. Consider appropriate weighting if combining multiple losses.
- Input Representation: Phonemes generally yield better results than characters due to a more consistent mapping to pronunciation.
Non-autoregressive acoustic models represent a crucial development in TTS, trading the sequential dependency of autoregressive models for massive gains in inference speed. While early models faced quality challenges, techniques like explicit duration prediction and variance adaptation (as seen in FastSpeech 2) have significantly narrowed the gap, making NAR TTS a highly viable and often preferred approach for many real-world speech synthesis applications. The next chapter will explore how the generated mel-spectrograms, whether from AR or NAR models, are converted into audible waveforms using neural vocoders.