Generative models seek to learn the underlying probability distribution of a dataset, pdata(x), enabling us to generate new samples that resemble the original data. However, models differ significantly in how they represent and learn this distribution, leading to different strengths, weaknesses, and application areas. Understanding this classification helps in selecting the right tool for a task and appreciating the design choices behind models like GANs and Diffusion Models.
We can broadly categorize generative models based on how they handle the probability density function pmodel(x) they are trying to learn.
Explicit Density Models
These models explicitly define and learn a probability density function pmodel(x;θ) parameterized by θ. The goal during training is typically to maximize the likelihood of the training data under this model. Within this category, we find two main approaches:
-
Tractable Explicit Density Models: These models define a density function pmodel(x;θ) that is computationally tractable, meaning we can directly calculate the likelihood pmodel(x) for any given data point x.
- Examples:
- Autoregressive Models (e.g., PixelRNN, PixelCNN): These models decompose the joint probability distribution over dimensions (e.g., pixels in an image) into a product of conditional probabilities using the chain rule: p(x)=∏ip(xi∣x1,...,xi−1). Each conditional probability is modeled by a neural network. While they can achieve high likelihood scores, generation is sequential and often slow, as predicting xi requires all previous x1...i−1.
- Normalizing Flows: These models transform a simple base distribution (like a Gaussian) pz(z) into a complex data distribution px(x) using a series of invertible transformations f with tractable Jacobians. The change of variables formula allows exact likelihood calculation: px(x)=pz(f−1(x))∣det(∂x∂f−1(x))∣. They offer exact likelihood evaluation and efficient sampling once trained, but designing expressive yet invertible transformations with tractable Jacobians can be challenging.
- Characteristics: Can directly evaluate likelihoods; often achieve state-of-the-art log-likelihood scores; generation can be slow (autoregressive) or constrained by architecture (flows).
-
Approximate Explicit Density Models: These models still define an explicit pmodel(x;θ), but it involves latent variables z and is generally intractable to compute or optimize directly. Instead, they optimize a lower bound on the log-likelihood (the Evidence Lower Bound, or ELBO) or use other approximation techniques.
- Example: Variational Autoencoders (VAEs): VAEs introduce latent variables z and assume data is generated via p(x∣z). They use an encoder network q(z∣x) to approximate the true posterior p(z∣x) and a decoder network p(x∣z). Training maximizes the ELBO:
logp(x)≥Ez∼q(z∣x)[logp(x∣z)]−DKL(q(z∣x)∣∣p(z))
where p(z) is a prior distribution (e.g., standard Gaussian) and DKL is the Kullback-Leibler divergence.
- Characteristics: Provide a learned latent space; relatively stable training; fast generation/sampling; often produce somewhat blurrier samples compared to GANs; provide only a lower bound on the likelihood.
Implicit Density Models
These models learn to generate samples from pmodel(x) without explicitly defining the density function itself. Instead, they provide a mechanism to sample directly from the distribution.
- Example: Generative Adversarial Networks (GANs): As revisited earlier, GANs use a generator network G that maps a random noise vector z (sampled from a simple prior pz(z)) to a data sample x=G(z). A discriminator network D tries to distinguish real data from generated samples. The generator learns implicitly to transform the prior distribution pz(z) into the target data distribution pdata(x) by trying to fool the discriminator. We never explicitly write down or compute pmodel(x).
- Characteristics: Can produce very sharp, high-fidelity samples; generation is typically fast (a single forward pass through the generator); training can be unstable (mode collapse, oscillations); evaluating the likelihood pmodel(x) is generally not possible.
Where Do Diffusion Models Fit?
Diffusion Models represent a more recent and highly successful family. Their classification can be nuanced:
- They are often trained by optimizing a variational lower bound on the likelihood (similar to VAEs), making them seem like Approximate Explicit Density Models. The objective function derived for Denoising Diffusion Probabilistic Models (DDPMs) is indeed a form of ELBO.
- However, they can also be interpreted through the lens of score matching, where the model learns the gradient of the log-density, ∇xlogp(x), known as the score function. This connection links them closely to the underlying explicit density function, even if the density itself isn't computed directly during generation.
- Generation involves an iterative denoising process, starting from noise and gradually refining it based on the learned score or conditional distribution p(xt−1∣xt), which differs significantly from the single-pass generation of typical VAEs or GANs.
Diffusion models effectively combine aspects of different families. They achieve sample quality often rivaling or exceeding GANs while generally offering more stable training and an explicit (though complex) connection to likelihood-based objectives. Their main drawback historically has been slower sampling speed due to the iterative generation process, though techniques like DDIM (Denoising Diffusion Implicit Models) have significantly improved this.
A classification of major generative model families based on how they handle the probability density function. This course focuses primarily on advanced techniques within GANs (Implicit Density) and Diffusion Models (Hybrid Characteristics).
Understanding this taxonomy provides context for the models we will focus on. GANs and Diffusion Models have emerged as particularly powerful for generating complex, high-dimensional data like images, despite their different underlying principles and trade-offs, which we will explore in detail throughout this course.