g for efficient, high-quality synthesis. * **Diffusion models**, an emerging approach for waveform generation. We will also cover how these models are conditioned on acoustic features and discuss methods for evaluating the quality of the generated audio. The chapter includes a hands-on section where you will use a pre-trained neural vocoder to synthesize audio.2c:T5c6,

Previous chapters focused on generating intermediate acoustic representations, such as mel-spectrograms, from text using Text-to-Speech (TTS) models. However, these representations are not directly audible sound. To produce the final speech waveform, we need a component that translates these acoustic features into high-fidelity audio signals. This component is known as a vocoder.

Traditional vocoding methods, often based on signal processing techniques like Griffin-Lim, can synthesize intelligible speech but frequently suffer from artifacts and lack naturalness. This chapter concentrates on modern neural vocoders, which utilize deep learning to generate significantly higher-quality audio.

You will examine several families of neural vocoders:

Autoregressive models like WaveNet, which generate audio samples sequentially.
Flow-based models like WaveGlow, enabling parallel waveform generation.
GAN-based models such as MelGAN and HiFi-GAN, which use adversarial training for efficient, high-quality synthesis.
Diffusion models, an emerging approach for waveform generation.

We will also cover how these models are conditioned on acoustic features and discuss methods for evaluating the quality of the generated audio. The chapter includes a hands-on section where you will use a pre-trained neural vocoder to synthesize audio.

Chapter 5: Neural Vocoders and Waveform Generation

Sections