Diffusion models are a powerful class of generative models, recently adapted with great success for high-fidelity audio waveform generation, including vocoding tasks. Originating from thermodynamics and later finding prominence in image generation, their application to audio synthesis offers a distinct alternative to autoregressive, flow-based, and GAN-based approaches.
The core idea behind diffusion models is surprisingly straightforward: systematically destroy structure in data (the "forward process") and then learn how to reverse this process to generate new data (the "reverse process").
The Forward Process: Gradual Noising
Imagine taking a clean audio waveform, x0. The forward diffusion process gradually adds small amounts of Gaussian noise to this waveform over a series of T discrete time steps. The amount of noise added at each step t is controlled by a predefined variance schedule, βt, where t ranges from 1 to T. Typically, T is large (e.g., 1000), and the βt values are small, often increasing over time.
This process defines a sequence of increasingly noisy samples x1,x2,...,xT. Each xt is generated from xt−1 by adding noise:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Here, N(x;μ,σ2I) denotes a Gaussian distribution with mean μ and diagonal covariance σ2I. The scaling factor 1−βt ensures that the overall variance doesn't explode.
A useful property is that we can directly sample xt given the original x0 without iterating through all intermediate steps. If we define αt=1−βt and αˉt=∏i=1tαi, then:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
As t approaches T, αˉT becomes close to zero, meaning that xT loses almost all information about the original x0 and approximates a standard Gaussian distribution, xT≈N(0,I). This forward process is fixed and does not involve any learning.
The Reverse Process: Learning to Denoise
The generative part happens in the reverse process. The goal is to learn a model that can reverse the noising steps: starting from pure noise xT∼N(0,I), can we iteratively remove noise to eventually obtain a clean waveform x0?
This requires learning the transition probability pθ(xt−1∣xt), parameterized by a neural network θ. If we knew the true posterior q(xt−1∣xt,x0), which is tractable and Gaussian when conditioned on x0, we could theoretically reverse the process perfectly. However, during generation, we don't have x0. Diffusion models approximate this reverse transition using a neural network:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
The neural network μθ(xt,t) needs to predict the mean of the distribution for xt−1 given the noisy xt and the timestep t. The variance Σθ(xt,t) is often kept fixed or related to the forward process variances βt.
Conditioning on Acoustic Features
For vocoding, we don't want to generate random audio; we want to generate audio corresponding to specific acoustic features, typically a mel-spectrogram c. The reverse process must therefore be conditioned on c:
pθ(xt−1∣xt,c)
The neural network learns to perform the denoising step xt→xt−1 guided by the conditioning information c. This conditioning is usually incorporated into the network architecture, often using techniques similar to those in other conditional generative models:
Concatenating c (potentially upsampled) with xt.
Using c to modulate intermediate feature maps in the network (e.g., via FiLM layers or adaptive group normalization).
Employing cross-attention mechanisms between xt representations and c.
The network architecture commonly used for μθ is a U-Net variant, similar to those successful in image generation, adapted for 1D audio signals. The time step t is also provided as input, usually via sinusoidal embeddings, allowing the network to learn time-dependent denoising behavior.
Training the Model
Instead of directly parameterizing μθ(xt,t,c), it's often more effective to train the network ϵθ(xt,t,c) to predict the noise component ϵ that was added to obtain xt from x0 in the forward process. Recall xt=αˉtx0+1−αˉtϵ where ϵ∼N(0,I).
The training objective simplifies to minimizing the mean squared error between the true noise ϵ and the predicted noise ϵθ, averaged over random time steps t, data samples x0, and noise samples ϵ:
Lsimple=Et∼[1,T],x0∼data,ϵ∼N(0,I)[∣∣ϵ−ϵθ(αˉtx0+1−αˉtϵ,c,t)∣∣2]
This objective trains the network to effectively estimate the noise at any given noise level t, conditioned on the acoustic features c.
Inference: Generating Audio
Audio generation starts by sampling xT from a standard Gaussian distribution N(0,I). Then, the learned reverse process is applied iteratively for t=T,T−1,...,1:
Predict the noise using the network: ϵ^=ϵθ(xt,c,t).
Estimate the mean μθ(xt,t,c) for pθ(xt−1∣xt,c). This can be derived from the predicted noise ϵ^. A common formulation is:
μθ(xt,t,c)=αt1(xt−1−αˉtβtϵθ(xt,c,t))
Sample xt−1 from the predicted distribution:
xt−1∼N(xt−1;μθ(xt,t,c),σt2I)
where σt2 is the variance of the reverse step (often set based on βt, e.g., σt2=βt or a related value).
After T steps, the final sample x0 represents the generated audio waveform conditioned on c.
The forward diffusion process adds noise to clean audio until it becomes pure noise. The learned reverse process starts from noise and iteratively removes it, guided by the conditional mel-spectrogram input, to synthesize the target audio waveform.
Advantages and Disadvantages
Advantages:
High Audio Quality: Diffusion vocoders have demonstrated state-of-the-art results, generating highly natural and artifact-free audio, often comparable or superior to autoregressive models.
Stable Training: Compared to GANs, the training process based on score matching (noise prediction) tends to be more stable and less sensitive to hyperparameter tuning.
Modeling Flexibility: The framework is flexible and connects to other generative modeling approaches like score-based generative models.
Disadvantages:
Slow Inference: The standard iterative sampling process requires many sequential steps (equal to T, often hundreds or thousands), making inference significantly slower than parallel methods like GANs or flow-based vocoders. This is a major drawback for real-time applications.
Computational Cost: Both training and inference can be computationally expensive due to the iterative nature and potentially large U-Net architectures.
Accelerating Diffusion Inference
The slow sampling speed is a primary area of research. Several techniques aim to reduce the number of required denoising steps (N<T) during inference:
Denoising Diffusion Implicit Models (DDIM): Modify the sampling process to take larger steps, significantly reducing the number of iterations needed (e.g., from 1000 down to 50 or fewer) often with minimal impact on quality.
Optimized Sampling Schedules: Design non-uniform time step schedules for sampling, focusing computation on more critical parts of the denoising process.
Knowledge Distillation: Train a faster model (like a GAN or flow-based model) to mimic the output distribution of a pre-trained diffusion model.
Even with acceleration, diffusion vocoders typically remain slower than the fastest parallel alternatives like HiFi-GAN.
Comparison to Other Vocoders
vs. Autoregressive (WaveNet): Diffusion models can achieve similar or better audio quality but avoid the strictly sequential generation, although their own iterative process is also slow unless accelerated. Training diffusion models might be simpler.
vs. Flow-Based (WaveGlow): Flow-based models allow fast, parallel sampling but might sometimes yield slightly lower audio quality compared to the best diffusion models. Diffusion models offer potentially higher quality at the cost of speed.
vs. GAN-Based (HiFi-GAN): GAN vocoders offer very fast parallel sampling and high quality. Diffusion models can match or exceed this quality but are typically slower to sample. Diffusion training is often considered more stable than GAN training.
Diffusion models provide a compelling approach to vocoding, trading off sampling speed for potentially superior audio fidelity and training stability. As research continues to accelerate their inference, they are becoming increasingly practical options for generating high-quality speech waveforms.
Was this section helpful?
Denoising Diffusion Probabilistic Models, Jonathan Ho, Ajay Jain, Pieter Abbeel, 2020Advances in Neural Information Processing Systems 33, Vol. 33 (Curran Associates, Inc.)DOI: 10.5555/3455702.3455871 - This seminal paper introduced the Denoising Diffusion Probabilistic Models (DDPM) framework, detailing the forward and reverse processes, and the simplified training objective, which established the foundation for diffusion models.
Denoising Diffusion Implicit Models, Jiaming Song, Chenlin Meng, Stefano Ermon, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.02502 - This paper introduced Denoising Diffusion Implicit Models (DDIM), presenting a method for significantly faster inference with fewer steps while maintaining generation quality, a critical contribution for applications requiring efficient sampling.
DiffWave: A Versatile Diffusion Model for Audio Synthesis, Zhifeng Kong, Wei Ping, Kaiming Ren, Kexin Ren, and Qifeng Liu, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2009.09761 - This work was among the first to successfully apply diffusion models to high-fidelity audio waveform generation, demonstrating its promise for vocoding and general audio synthesis by adapting the DDPM framework for 1D audio signals.
ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech, Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren, 2022Proceedings of the 30th ACM International Conference on Multimedia (ACM)DOI: 10.48550/arXiv.2207.05831 - This research proposed ProDiff, an approach for high-quality text-to-speech that employs a diffusion model for the vocoding component and incorporates methods for accelerating inference, addressing a key challenge of diffusion models in speech synthesis.