Previous chapters focused on generating intermediate acoustic representations, such as mel-spectrograms, from text using Text-to-Speech (TTS) models. However, these representations are not directly audible sound. To produce the final speech waveform, we need a component that translates these acoustic features into high-fidelity audio signals. This component is known as a vocoder.
Traditional vocoding methods, often based on signal processing techniques like Griffin-Lim, can synthesize intelligible speech but frequently suffer from artifacts and lack naturalness. This chapter concentrates on modern neural vocoders, which utilize deep learning to generate significantly higher-quality audio.
You will examine several families of neural vocoders:
We will also cover how these models are conditioned on acoustic features and discuss methods for evaluating the quality of the generated audio. The chapter includes a hands-on section where you will use a pre-trained neural vocoder to synthesize audio.
5.1 Limitations of Traditional Vocoders
5.2 Autoregressive Waveform Models (WaveNet, WaveRNN)
5.3 Flow-Based Vocoders (WaveGlow, FloWaveNet)
5.4 GAN-Based Vocoders (MelGAN, HiFi-GAN)
5.5 Diffusion Models for Vocoding
5.6 Conditioning Neural Vocoders
5.7 Evaluation of Synthesized Audio Quality
5.8 Hands-on Practical: Using a Neural Vocoder
© 2025 ApX Machine Learning