Generative models aim to learn the underlying probability distribution p(x) of a dataset. Once this distribution is learned, the model can generate new data samples xnew that appear to be drawn from the same distribution as the original data. Think of generating realistic images, composing music, or creating synthetic text. Before focusing on diffusion models, two prominent families of generative models, Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), are briefly examined. Understanding their approaches provides context for the development and advantages of diffusion models.
Variational Autoencoders (VAEs)
VAEs are based on the principles of probabilistic graphical models and variational inference. They consist of two main components, typically implemented as neural networks:
- Encoder (Recognition Model): This network, often denoted as qϕ(z∣x), takes an input data point x and maps it to a distribution in a lower-dimensional latent space z. Usually, this distribution is parameterized as a Gaussian with mean μ(x) and variance σ2(x). Instead of outputting a single point in the latent space, the encoder outputs the parameters of a probability distribution from which we can sample latent vectors z.
- Decoder (Generative Model): This network, pθ(x∣z), takes a latent vector z sampled from the distribution provided by the encoder (or sampled from a prior distribution p(z) during generation) and reconstructs the original data point x. It aims to map points from the latent space back to the data space.
Diagram illustrating the structure of a Variational Autoencoder.
The training objective for a VAE is derived from maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of the data, logp(x). This objective typically consists of two terms:
- Reconstruction Loss: Measures how well the decoder reconstructs the input data after encoding and decoding. Often implemented as Mean Squared Error (MSE) for continuous data or Binary Cross-Entropy (BCE) for binary data. This encourages pθ(x∣z) to produce outputs similar to the input x given a latent representation z derived from x.
- KL Divergence Regularizer: Measures the difference between the encoder's output distribution qϕ(z∣x) and a predefined prior distribution for the latent variables, usually a standard Gaussian p(z)=N(0,I). This term acts as a regularizer, encouraging the encoder to distribute the latent representations smoothly around the origin, which facilitates sampling new points during generation.
The ELBO is formulated as:
LELBO(x;θ,ϕ)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))
VAEs are generally known for stable training and learning smooth, meaningful latent spaces where interpolating between points often yields coherent transitions in the generated data. However, they sometimes suffer from producing samples that appear blurry compared to the original data or samples generated by other methods like GANs. This blurriness can be attributed partly to the nature of the reconstruction loss (like MSE) and the Gaussian assumptions.
Generative Adversarial Networks (GANs)
GANs employ a different approach based on game theory. They involve two neural networks trained in competition:
- Generator (G): This network takes a random noise vector z (typically sampled from a simple distribution like a Gaussian or Uniform distribution) as input and tries to generate data x^=G(z) that looks indistinguishable from real data.
- Discriminator (D): This network acts as a classifier. It takes either a real data sample x from the training set or a fake sample x^ from the generator and tries to determine whether the input is real or fake. It outputs a probability D(x) indicating the likelihood that the input x is real.
Diagram illustrating the structure and adversarial training process of a Generative Adversarial Network.
The training process is an adversarial game:
- The Discriminator is trained to maximize its ability to correctly classify real and fake samples. It wants to maximize D(x) for real x and minimize D(G(z)) (maximize 1−D(G(z))) for fake samples.
- The Generator is trained to minimize the Discriminator's ability to detect its fakes. It wants to maximize D(G(z)), effectively fooling the Discriminator into thinking the generated samples are real.
This leads to a minimax objective function:
GminDmaxV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
GANs are celebrated for their ability to generate sharp and highly realistic samples, particularly in image generation tasks. However, their training dynamics can be unstable. Issues like mode collapse (where the generator produces only a limited variety of samples) and difficulties in convergence are common challenges. Finding the right balance between the generator and discriminator during training requires careful tuning of architectures, hyperparameters, and loss functions.
Setting the Stage for Diffusion Models
Both VAEs and GANs represent significant advancements in generative modeling, each with its strengths and weaknesses. VAEs offer stable training and well-structured latent spaces but sometimes lack sample sharpness. GANs produce sharp samples but can be difficult to train and may suffer from mode collapse.
Diffusion models, the focus of this course, offer an alternative probabilistic approach. They aim to achieve both high sample quality and stable training by gradually transforming data into noise and then learning to reverse this process. This iterative refinement process, inspired by non-equilibrium thermodynamics, provides a different mechanism for modeling complex data distributions, often leading to state-of-the-art results in image synthesis and other domains. We will begin exploring the mechanics of this process in the next chapter.