Flow-based generative models offer a powerful framework for acoustic modeling in Text-to-Speech (TTS). These models present an alternative to autoregressive models like Tacotron and non-autoregressive approaches such as FastSpeech. Unlike approaches that generate outputs sequentially or rely solely on auxiliary losses for parallel generation, flow-based models learn an exact, invertible transformation from a simple base distribution (like a Gaussian) to the complex data distribution of mel-spectrograms, conditioned on the input text.
The Essence of Normalizing Flows
At their core, normalizing flows construct a complex distribution by applying a sequence of invertible and differentiable transformations f to samples z drawn from a simple base distribution pZ(z), typically a standard multivariate Gaussian. The goal is to model the target data distribution pX(x), where x represents our data (e.g., mel-spectrograms).
The transformation x=f(z) allows us to compute the exact likelihood of a data point x using the change of variables formula from probability theory. If z=f−1(x), the probability density function of x is given by:
pX(x)=pZ(f−1(x))det(∂x∂f−1(x))
Equivalently, using the inverse function theorem:
pX(x)=pZ(z)det(∂x∂z)=pZ(z)det(∂z∂f(z))−1
Here, ∂x∂f−1(x) (or ∂z∂f(z)) is the Jacobian matrix of the transformation (or its inverse), and det(⋅) denotes its determinant.
The critical insight is that if we design the transformation f such that:
It is easily invertible (i.e., calculating z=f−1(x) is efficient).
The determinant of its Jacobian is easy to compute.
Then, we can efficiently calculate the exact likelihood pX(x) for any given data point x. Training is typically done by maximizing the log-likelihood of the training data:
logpX(x)=logpZ(z)−logdet(∂z∂f(z))
where z=f−1(x).
Applying Flows to Acoustic Modeling
In the context of TTS, we want to model the conditional distribution of mel-spectrograms x given some linguistic features or text embeddings c. So, the flow f becomes conditional: x=f(z;c). The base distribution pZ(z) remains a simple prior, like N(0,I).
Training: During training, we take a ground truth mel-spectrogram x and its corresponding condition c. We compute z=f−1(x;c) and maximize the conditional log-likelihood logpX(x∣c) using the change of variables formula. This directly optimizes the model to assign high probability to realistic mappings from text to speech features.
Inference (Synthesis): To synthesize a mel-spectrogram for a new condition c, we first sample a latent variable z from the prior pZ(z). Then, we compute the corresponding mel-spectrogram x using the forward transformation: x=f(z;c). Since the forward pass f is typically designed to be efficient and parallel (often involving operations like convolutions and affine transformations), synthesis can be very fast, similar to non-autoregressive models.
Overview of a conditional flow-based model for TTS. Inference transforms noise into data, while training transforms data into noise and maximizes likelihood. Conditioning guides the transformation.
Prominent Flow-Based TTS Architectures
Several specific architectures have successfully applied normalizing flows to TTS acoustic modeling:
Glow-TTS: Inspired by the Glow model for image generation, Glow-TTS adapts the flow architecture (affine coupling layers, invertible 1x1 convolutions) for mel-spectrogram generation. It uses an alignment mechanism (related to attention, but monotonic during inference) learned variationally during training to handle the variable-length alignment between text input and speech output. This alignment is a significant part of making flows work for sequence-to-sequence tasks like TTS.
Flowtron: This model integrates elements from Tacotron (like location-sensitive attention for alignment) directly into a flow-based decoder architecture (specifically, using affine coupling layers similar to Glow). Flowtron explicitly maps variations in the latent code z to variations in speech prosody and style, offering control over the synthesized output. By manipulating parts of z, one can influence pitch, rhythm, and other expressive aspects of the generated speech.
These models typically consist of:
A Text Encoder (e.g., Transformer or LSTM layers) to convert input text/phonemes into hidden representations.
An Alignment Mechanism (e.g., attention-based or duration predictor) to align text representations with the mel-spectrogram frame rate.
A Conditional Flow Decoder that transforms samples z from the prior distribution into mel-spectrogram frames x, conditioned on the aligned text representations. This decoder is built from layers designed to be invertible with tractable Jacobians.
Advantages
Flow-based models bring unique properties to TTS:
Advantages:
Exact Log-Likelihood Training: Training is based on maximizing the exact log-likelihood of the data, providing a stable and well-understood optimization objective, unlike the adversarial training of GANs.
Fast Parallel Synthesis: Like other non-autoregressive methods, inference is parallel and fast once the latent variable z is sampled. The entire mel-spectrogram can often be generated in a single forward pass through the flow network.
Latent Space Control: The explicit mapping from a structured latent space Z to the data space X offers potential for controllable synthesis. Manipulating z can influence attributes like speaker identity, prosody, or emotion, particularly in models like Flowtron designed for this purpose.
High-Quality Output: Flow-based models have demonstrated the ability to generate high-fidelity mel-spectrograms, leading to natural-sounding synthesized speech when combined with a good neural vocoder.
Considerations:
Model Complexity and Size: Flow networks often need to be deep to model complex data distributions effectively, which can lead to large model sizes and significant computational requirements during training.
Architectural Constraints: The need for invertibility and tractable Jacobian determinants restricts the design choices for the network layers (e.g., affine coupling, invertible convolutions). This might limit the representational power compared to less constrained architectures.
Alignment Handling: Effectively aligning the input text sequence with the output mel-spectrogram sequence within the flow framework requires careful design, often involving separate attention mechanisms or duration predictors, adding complexity.
Compared to autoregressive models, flows offer much faster synthesis but might require more complex architectures for high quality. Compared to other non-autoregressive models (like FastSpeech), flows provide the benefit of exact likelihood training but can be computationally heavier. They generally offer more stable training than GANs while achieving competitive quality.
Flow-based models represent a compelling direction in generative modeling for speech synthesis, balancing parallel inference speed with principled likelihood-based training and offering avenues for expressive control. Understanding their mechanism provides valuable insight into the diverse techniques available for building advanced TTS systems.
Was this section helpful?
Density estimation using Real NVP, Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1605.08803 - This paper introduces Real NVP, presenting affine coupling layers and invertible 1x1 convolutions, which are core building blocks for many subsequent flow-based generative models.
Glow: Generative Flow with Invertible 1x1 Convolutions, Diederik P. Kingma, Prafulla Dhariwal, 2018arXiv, Vol. 31DOI: 10.48550/arXiv.1807.03039 - Presents the Glow architecture, a significant advancement in normalizing flows for image generation that directly inspired Glow-TTS by leveraging invertible 1x1 convolutions and affine coupling layers.