Diffusion models represent another powerful class of generative models, recently adapted with great success for high-fidelity audio waveform generation, including vocoding tasks. Originating from thermodynamics and later finding prominence in image generation, their application to audio synthesis offers a distinct alternative to autoregressive, flow-based, and GAN-based approaches.
The core idea behind diffusion models is surprisingly straightforward: systematically destroy structure in data (the "forward process") and then learn how to reverse this process to generate new data (the "reverse process").
Imagine taking a clean audio waveform, x0. The forward diffusion process gradually adds small amounts of Gaussian noise to this waveform over a series of T discrete time steps. The amount of noise added at each step t is controlled by a predefined variance schedule, βt, where t ranges from 1 to T. Typically, T is large (e.g., 1000), and the βt values are small, often increasing over time.
This process defines a sequence of increasingly noisy samples x1,x2,...,xT. Each xt is generated from xt−1 by adding noise: q(xt∣xt−1)=N(xt;1−βtxt−1,βtI) Here, N(x;μ,σ2I) denotes a Gaussian distribution with mean μ and diagonal covariance σ2I. The scaling factor 1−βt ensures that the overall variance doesn't explode.
A useful property is that we can directly sample xt given the original x0 without iterating through all intermediate steps. If we define αt=1−βt and αˉt=∏i=1tαi, then: q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I) As t approaches T, αˉT becomes close to zero, meaning that xT loses almost all information about the original x0 and approximates a standard Gaussian distribution, xT≈N(0,I). This forward process is fixed and does not involve any learning.
The generative part happens in the reverse process. The goal is to learn a model that can reverse the noising steps: starting from pure noise xT∼N(0,I), can we iteratively remove noise to eventually obtain a clean waveform x0?
This requires learning the transition probability pθ(xt−1∣xt), parameterized by a neural network θ. If we knew the true posterior q(xt−1∣xt,x0), which is tractable and Gaussian when conditioned on x0, we could theoretically reverse the process perfectly. However, during generation, we don't have x0. Diffusion models approximate this reverse transition using a neural network: pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t)) The neural network μθ(xt,t) needs to predict the mean of the distribution for xt−1 given the noisy xt and the timestep t. The variance Σθ(xt,t) is often kept fixed or related to the forward process variances βt.
For vocoding, we don't want to generate random audio; we want to generate audio corresponding to specific acoustic features, typically a mel-spectrogram c. The reverse process must therefore be conditioned on c: pθ(xt−1∣xt,c) The neural network learns to perform the denoising step xt→xt−1 guided by the conditioning information c. This conditioning is usually incorporated into the network architecture, often using techniques similar to those in other conditional generative models:
The network architecture commonly used for μθ is a U-Net variant, similar to those successful in image generation, adapted for 1D audio signals. The time step t is also provided as input, usually via sinusoidal embeddings, allowing the network to learn time-dependent denoising behavior.
Instead of directly parameterizing μθ(xt,t,c), it's often more effective to train the network ϵθ(xt,t,c) to predict the noise component ϵ that was added to obtain xt from x0 in the forward process. Recall xt=αˉtx0+1−αˉtϵ where ϵ∼N(0,I).
The training objective simplifies to minimizing the mean squared error between the true noise ϵ and the predicted noise ϵθ, averaged over random time steps t, data samples x0, and noise samples ϵ: Lsimple=Et∼[1,T],x0∼data,ϵ∼N(0,I)[∣∣ϵ−ϵθ(αˉtx0+1−αˉtϵ,c,t)∣∣2] This objective trains the network to effectively estimate the noise at any given noise level t, conditioned on the acoustic features c.
Audio generation starts by sampling xT from a standard Gaussian distribution N(0,I). Then, the learned reverse process is applied iteratively for t=T,T−1,...,1:
After T steps, the final sample x0 represents the generated audio waveform conditioned on c.
The forward diffusion process adds noise to clean audio until it becomes pure noise. The learned reverse process starts from noise and iteratively removes it, guided by the conditional mel-spectrogram input, to synthesize the target audio waveform.
Advantages:
Disadvantages:
The slow sampling speed is a primary area of research. Several techniques aim to reduce the number of required denoising steps (N<T) during inference:
Even with acceleration, diffusion vocoders typically remain slower than the fastest parallel alternatives like HiFi-GAN.
Diffusion models provide a compelling approach to vocoding, trading off sampling speed for potentially superior audio fidelity and training stability. As research continues to accelerate their inference, they are becoming increasingly practical options for generating high-quality speech waveforms.
© 2025 ApX Machine Learning