All Courses

Diffusion Models for Vocoding

Diffusion models represent another powerful class of generative models, recently adapted with great success for high-fidelity audio waveform generation, including vocoding tasks. Originating from thermodynamics and later finding prominence in image generation, their application to audio synthesis offers a distinct alternative to autoregressive, flow-based, and GAN-based approaches.

The core idea behind diffusion models is surprisingly straightforward: systematically destroy structure in data (the "forward process") and then learn how to reverse this process to generate new data (the "reverse process").

The Forward Process: Gradual Noising

Imagine taking a clean audio waveform, $x_0$ . The forward diffusion process gradually adds small amounts of Gaussian noise to this waveform over a series of $T$ discrete time steps. The amount of noise added at each step $t$ is controlled by a predefined variance schedule, $\beta_t$ , where $t$ ranges from 1 to $T$ . Typically, $T$ is large (e.g., 1000), and the $\beta_t$ values are small, often increasing over time.

This process defines a sequence of increasingly noisy samples $x_1, x_2, ..., x_T$ . Each $x_t$ is generated from $x_{t-1}$ by adding noise: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$ Here, $\mathcal{N}(x; \mu, \sigma^2 I)$ denotes a Gaussian distribution with mean $\mu$ and diagonal covariance $\sigma^2 I$ . The scaling factor $\sqrt{1 - \beta_t}$ ensures that the overall variance doesn't explode.

A useful property is that we can directly sample $x_t$ given the original $x_0$ without iterating through all intermediate steps. If we define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ , then: $q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$ As $t$ approaches $T$ , $\bar{\alpha}_T$ becomes close to zero, meaning that $x_T$ loses almost all information about the original $x_0$ and approximates a standard Gaussian distribution, $x_T \approx \mathcal{N}(0, I)$ . This forward process is fixed and does not involve any learning.

The Reverse Process: Learning to Denoise

The generative part happens in the reverse process. The goal is to learn a model that can reverse the noising steps: starting from pure noise $x_T \sim \mathcal{N}(0, I)$ , can we iteratively remove noise to eventually obtain a clean waveform $x_0$ ?

This requires learning the transition probability $p_\theta(x_{t-1} | x_t)$ , parameterized by a neural network $\theta$ . If we knew the true posterior $q(x_{t-1} | x_t, x_0)$ , which is tractable and Gaussian when conditioned on $x_0$ , we could theoretically reverse the process perfectly. However, during generation, we don't have $x_0$ . Diffusion models approximate this reverse transition using a neural network: $p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$ The neural network $\mu_\theta(x_t, t)$ needs to predict the mean of the distribution for $x_{t-1}$ given the noisy $x_t$ and the timestep $t$ . The variance $\Sigma_\theta(x_t, t)$ is often kept fixed or related to the forward process variances $\beta_t$ .

Conditioning on Acoustic Features

For vocoding, we don't want to generate random audio; we want to generate audio corresponding to specific acoustic features, typically a mel-spectrogram $c$ . The reverse process must therefore be conditioned on $c$ : $p_\theta(x_{t-1} | x_t, c)$ The neural network learns to perform the denoising step $x_t \rightarrow x_{t-1}$ guided by the conditioning information $c$ . This conditioning is usually incorporated into the network architecture, often using techniques similar to those in other conditional generative models:

Concatenating $c$ (potentially upsampled) with $x_t$ .
Using $c$ to modulate intermediate feature maps in the network (e.g., via FiLM layers or adaptive group normalization).
Employing cross-attention mechanisms between $x_t$ representations and $c$ .

The network architecture commonly used for $\mu_\theta$ is a U-Net variant, similar to those successful in image generation, adapted for 1D audio signals. The time step $t$ is also provided as input, usually via sinusoidal embeddings, allowing the network to learn time-dependent denoising behavior.

Training the Model

Instead of directly parameterizing $\mu_\theta(x_t, t, c)$ , it's often more effective to train the network $\epsilon_\theta(x_t, t, c)$ to predict the noise component $\epsilon$ that was added to obtain $x_t$ from $x_0$ in the forward process. Recall $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$ .

The training objective simplifies to minimizing the mean squared error between the true noise $\epsilon$ and the predicted noise $\epsilon_\theta$ , averaged over random time steps $t$ , data samples $x_0$ , and noise samples $\epsilon$ : $L_{simple} = \mathbb{E}_{t \sim [1, T], x_0 \sim data, \epsilon \sim \mathcal{N}(0, I)} \left[ || \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, c, t) ||^2 \right]$ This objective trains the network to effectively estimate the noise at any given noise level $t$ , conditioned on the acoustic features $c$ .

Inference: Generating Audio

Audio generation starts by sampling $x_T$ from a standard Gaussian distribution $\mathcal{N}(0, I)$ . Then, the learned reverse process is applied iteratively for $t = T, T-1, ..., 1$ :

Predict the noise using the network: $\hat{\epsilon} = \epsilon_\theta(x_t, c, t)$ .
Estimate the mean $\mu_\theta(x_t, t, c)$ for $p_\theta(x_{t-1} | x_t, c)$ . This can be derived from the predicted noise $\hat{\epsilon}$ . A common formulation is: $\mu_\theta(x_t, t, c) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, c, t) \right)$
Sample $x_{t-1}$ from the predicted distribution: $x_{t-1} \sim \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \sigma_t^2 I)$ where $\sigma_t^2$ is the variance of the reverse step (often set based on $\beta_t$ , e.g., $\sigma_t^2 = \beta_t$ or a related value).

After $T$ steps, the final sample $x_0$ represents the generated audio waveform conditioned on $c$ .

The forward diffusion process adds noise to clean audio until it becomes pure noise. The learned reverse process starts from noise and iteratively removes it, guided by the conditional mel-spectrogram input, to synthesize the target audio waveform.

Advantages and Disadvantages

Advantages:

High Audio Quality: Diffusion vocoders have demonstrated state-of-the-art results, generating highly natural and artifact-free audio, often comparable or superior to autoregressive models.
Stable Training: Compared to GANs, the training process based on score matching (noise prediction) tends to be more stable and less sensitive to hyperparameter tuning.
Modeling Flexibility: The framework is flexible and connects to other generative modeling approaches like score-based generative models.

Disadvantages:

Slow Inference: The standard iterative sampling process requires many sequential steps (equal to $T$ , often hundreds or thousands), making inference significantly slower than parallel methods like GANs or flow-based vocoders. This is a major drawback for real-time applications.
Computational Cost: Both training and inference can be computationally expensive due to the iterative nature and potentially large U-Net architectures.

Accelerating Diffusion Inference

The slow sampling speed is a primary area of research. Several techniques aim to reduce the number of required denoising steps ( $N < T$ ) during inference:

Denoising Diffusion Implicit Models (DDIM): Modify the sampling process to take larger steps, significantly reducing the number of iterations needed (e.g., from 1000 down to 50 or fewer) often with minimal impact on quality.
Optimized Sampling Schedules: Design non-uniform time step schedules for sampling, focusing computation on more critical parts of the denoising process.
Knowledge Distillation: Train a faster model (like a GAN or flow-based model) to mimic the output distribution of a pre-trained diffusion model.

Even with acceleration, diffusion vocoders typically remain slower than the fastest parallel alternatives like HiFi-GAN.

Comparison to Other Vocoders

vs. Autoregressive (WaveNet): Diffusion models can achieve similar or better audio quality but avoid the strictly sequential generation, although their own iterative process is also slow unless accelerated. Training diffusion models might be simpler.
vs. Flow-Based (WaveGlow): Flow-based models allow fast, parallel sampling but might sometimes yield slightly lower audio quality compared to the best diffusion models. Diffusion models offer potentially higher quality at the cost of speed.
vs. GAN-Based (HiFi-GAN): GAN vocoders offer very fast parallel sampling and high quality. Diffusion models can match or exceed this quality but are typically slower to sample. Diffusion training is often considered more stable than GAN training.

Diffusion models provide a compelling approach to vocoding, trading off sampling speed for potentially superior audio fidelity and training stability. As research continues to accelerate their inference, they are becoming increasingly practical options for generating high-quality speech waveforms.

Was this section helpful?