Moving beyond autoregressive models like Tacotron and non-autoregressive approaches like FastSpeech, flow-based generative models offer another powerful framework for acoustic modeling in Text-to-Speech (TTS). Instead of generating outputs sequentially or relying solely on auxiliary losses for parallel generation, flow-based models learn an exact, invertible transformation from a simple base distribution (like a Gaussian) to the complex data distribution of mel-spectrograms, conditioned on the input text.
At their core, normalizing flows construct a complex distribution by applying a sequence of invertible and differentiable transformations f to samples z drawn from a simple base distribution pZ(z), typically a standard multivariate Gaussian. The goal is to model the target data distribution pX(x), where x represents our data (e.g., mel-spectrograms).
The transformation x=f(z) allows us to compute the exact likelihood of a data point x using the change of variables formula from probability theory. If z=f−1(x), the probability density function of x is given by:
pX(x)=pZ(f−1(x))det(∂x∂f−1(x))Equivalently, using the inverse function theorem:
pX(x)=pZ(z)det(∂x∂z)=pZ(z)det(∂z∂f(z))−1Here, ∂x∂f−1(x) (or ∂z∂f(z)) is the Jacobian matrix of the transformation (or its inverse), and det(⋅) denotes its determinant.
The critical insight is that if we design the transformation f such that:
Then, we can efficiently calculate the exact likelihood pX(x) for any given data point x. Training is typically done by maximizing the log-likelihood of the training data:
logpX(x)=logpZ(z)−logdet(∂z∂f(z))where z=f−1(x).
In the context of TTS, we want to model the conditional distribution of mel-spectrograms x given some linguistic features or text embeddings c. So, the flow f becomes conditional: x=f(z;c). The base distribution pZ(z) remains a simple prior, like N(0,I).
Overview of a conditional flow-based model for TTS. Inference transforms noise into data, while training transforms data into noise and maximizes likelihood. Conditioning guides the transformation.
Several specific architectures have successfully applied normalizing flows to TTS acoustic modeling:
Glow-TTS: Inspired by the Glow model for image generation, Glow-TTS adapts the flow architecture (affine coupling layers, invertible 1x1 convolutions) for mel-spectrogram generation. It uses an alignment mechanism (related to attention, but monotonic during inference) learned variationally during training to handle the variable-length alignment between text input and speech output. This alignment is a significant part of making flows work for sequence-to-sequence tasks like TTS.
Flowtron: This model integrates elements from Tacotron (like location-sensitive attention for alignment) directly into a flow-based decoder architecture (specifically, using affine coupling layers similar to Glow). Flowtron explicitly maps variations in the latent code z to variations in speech prosody and style, offering control over the synthesized output. By manipulating parts of z, one can influence pitch, rhythm, and other expressive aspects of the generated speech.
These models typically consist of:
Flow-based models bring unique properties to TTS:
Advantages:
Considerations:
Compared to autoregressive models, flows offer much faster synthesis but might require more complex architectures for high quality. Compared to other non-autoregressive models (like FastSpeech), flows provide the benefit of exact likelihood training but can be computationally heavier. They generally offer more stable training than GANs while achieving competitive quality.
Flow-based models represent a compelling direction in generative modeling for speech synthesis, balancing parallel inference speed with principled likelihood-based training and offering avenues for expressive control. Understanding their mechanism provides valuable insight into the diverse techniques available for building advanced TTS systems.
© 2025 ApX Machine Learning