Generative models aim to learn the underlying probability distribution p(x) of a dataset. Once this distribution is learned, the model can generate new data samples xnew that appear to be drawn from the same distribution as the original data. Think of generating realistic images, composing music, or creating synthetic text. Before we focus on diffusion models, let's briefly examine two prominent families of generative models: Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). Understanding their approaches provides context for the development and advantages of diffusion models.
VAEs are based on the principles of probabilistic graphical models and variational inference. They consist of two main components, typically implemented as neural networks:
Diagram illustrating the structure of a Variational Autoencoder.
The training objective for a VAE is derived from maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of the data, logp(x). This objective typically consists of two terms:
The ELBO is formulated as:
LELBO(x;θ,ϕ)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))VAEs are generally known for stable training and learning smooth, meaningful latent spaces where interpolating between points often yields coherent transitions in the generated data. However, they sometimes suffer from producing samples that appear blurry compared to the original data or samples generated by other methods like GANs. This blurriness can be attributed partly to the nature of the reconstruction loss (like MSE) and the Gaussian assumptions.
GANs employ a different approach based on game theory. They involve two neural networks trained in competition:
Diagram illustrating the structure and adversarial training process of a Generative Adversarial Network.
The training process is an adversarial game:
This leads to a minimax objective function:
GminDmaxV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]GANs are celebrated for their ability to generate sharp and highly realistic samples, particularly in image generation tasks. However, their training dynamics can be unstable. Issues like mode collapse (where the generator produces only a limited variety of samples) and difficulties in convergence are common challenges. Finding the right balance between the generator and discriminator during training requires careful tuning of architectures, hyperparameters, and loss functions.
Both VAEs and GANs represent significant advancements in generative modeling, each with its strengths and weaknesses. VAEs offer stable training and well-structured latent spaces but sometimes lack sample sharpness. GANs produce sharp samples but can be difficult to train and may suffer from mode collapse.
Diffusion models, the focus of this course, offer an alternative probabilistic approach. They aim to achieve both high sample quality and stable training by gradually transforming data into noise and then learning to reverse this process. This iterative refinement process, inspired by non-equilibrium thermodynamics, provides a different mechanism for modeling complex data distributions, often leading to state-of-the-art results in image synthesis and other domains. We will begin exploring the mechanics of this process in the next chapter.
© 2025 ApX Machine Learning