As we've established, the expressiveness of the approximate posterior qϕ(z∣x) is a significant factor in VAE performance. Standard choices, like diagonal Gaussians, often fall short of capturing the true complexity of pθ(z∣x). This section introduces a powerful alternative: using implicit models for the variational posterior or even the prior.
An implicit model is one where we can easily sample from the distribution but cannot easily (or at all) evaluate its probability density function (PDF) or probability mass function (PMF) for a given point. Think of a generative neural network: you feed it random noise, and it produces a complex output. You can get samples, but what's the logq(z) for a specific z? That's often intractable.
Formally, if we want an implicit posterior qϕ(z∣x), we define it through a deterministic transformation gϕ of a simple noise variable ϵ (e.g., ϵ∼N(0,I)) and the input x:
z=gϕ(ϵ,x)While we can generate samples z from qϕ(z∣x) by first sampling ϵ and then applying gϕ, the density qϕ(z∣x) itself is not directly accessible. This is in stark contrast to, say, a Gaussian posterior where qϕ(z∣x)=N(z∣μϕ(x),Σϕ(x)), and we can directly compute logqϕ(z∣x).
The Evidence Lower Bound (ELBO) for a VAE is:
L(θ,ϕ;x)=Ez∼qϕ(z∣x)[logpθ(x∣z)]−KL(qϕ(z∣x)∣∣p(z))Let's expand the KL divergence term:
KL(qϕ(z∣x)∣∣p(z))=Ez∼qϕ(z∣x)[logqϕ(z∣x)−logp(z)]If qϕ(z∣x) is implicit, the term Ez∼qϕ(z∣x)[logqϕ(z∣x)] (the negative entropy of qϕ) becomes problematic because logqϕ(z∣x) is unknown. This is the central difficulty when using implicit models for the variational posterior.
If the prior p(z) is also chosen to be an implicit model (perhaps to represent a complex target structure in the latent space), then the Ez∼qϕ(z∣x)[logp(z)] term also becomes intractable through direct evaluation. However, if p(z) is simple (e.g., N(0,I)), this second term can still be estimated by sampling z∼qϕ(z∣x) and evaluating logp(z).
So, how do we optimize the ELBO if we can't evaluate logqϕ(z∣x)? Several strategies have emerged, with adversarial training being particularly prominent.
Density Ratio Estimation / Adversarial Approaches: The KL divergence, or parts of it, can often be rewritten or approximated using techniques reminiscent of Generative Adversarial Networks (GANs). The core idea is to train a discriminator (or critic) network Dψ(z) (or Dψ(z,x) if conditioned on x) to distinguish between samples from qϕ(z∣x) and samples from p(z). The objective for qϕ (specifically, the parameters of gϕ) is then to produce samples z that "fool" this discriminator.
For instance, the term KL(qϕ(z∣x)∣∣p(z)) can be estimated or optimized indirectly. If p(z) is a simple distribution, the main issue is the entropy term H(qϕ(z∣x))=−Ez∼qϕ(z∣x)[logqϕ(z∣x)]. Various f-divergences, including the KL divergence, can be expressed using a variational representation that involves a discriminator. For example, the Jensen-Shannon divergence, JS(qϕ∣∣p), which is minimized by the original GAN objective, is one such case.
The Adversarial Variational Bayes (AVB) framework, which we will discuss in the next section, provides a specific mechanism. It often involves training a separate network T(z) to approximate logqϕ(z∣x) or directly estimates the KL term. The key is that qϕ(z∣x) is implicitly defined through z=gϕ(ϵ,x), and its parameters are updated based on gradients flowing from both the reconstruction term and the adversarial term approximating the KL divergence.
Kernel Density Estimation (KDE): One could, in principle, draw many samples from qϕ(z∣x) and use KDE to estimate the density qϕ(z∣x) at any point z. Then, logqϕ(z∣x) could be approximated. However, KDE suffers severely from the curse of dimensionality and requires a very large number of samples, making it impractical for typical latent space dimensions in VAEs.
Likelihood-free Inference Methods: The broader field of likelihood-free inference (also known as Approximate Bayesian Computation, ABC) deals with situations where the likelihood (or in our case, the posterior density) is intractable but sampling is possible. Some techniques from this field can inspire approaches for VAEs with implicit posteriors, often involving comparing summary statistics of data generated from different models.
The following diagram illustrates how an implicit posterior might be integrated into a VAE, using an adversarial approach to handle the KL divergence term.
Information flow in a VAE with an implicit posterior qϕ(z∣x) and an adversarial mechanism for the KL divergence. The sampler gϕ generates latent codes z, which are used for reconstruction. These samples, along with samples from a prior p(z), are fed to a discriminator Dψ to compute an adversarial loss that helps shape qϕ(z∣x).
The same philosophy can be applied to the prior distribution p(z). Instead of a fixed, simple prior like N(0,I), one might want to learn a more complex prior, perhaps one that itself is an implicit model. Adversarial Autoencoders (AAEs), which we'll touch upon in Chapter 7, often use an adversarial loss to match the aggregated posterior q(z)=Epdata(x)[qϕ(z∣x)] to a chosen prior p(z), and this p(z) could be implicitly defined by samples from another generator. If both qϕ(z∣x) and p(z) are implicit, then the KL(qϕ(z∣x)∣∣p(z)) term usually relies entirely on adversarial or density ratio estimation techniques.
Advantages:
Downsides:
Variational inference with implicit models opens up a rich avenue for enhancing VAEs. By defining qϕ(z∣x) (and potentially p(z)) via a sampling procedure z=gϕ(ϵ,x), we can model distributions far more complex than standard Gaussians. The main hurdle, the intractability of logqϕ(z∣x), is typically addressed using adversarial training schemes. The next section on Adversarial Variational Bayes (AVB) will provide a more concrete example of how these ideas are put into practice to create more powerful VAEs. These methods represent a significant step towards bridging the gap between the tractable but limited posteriors of basic VAEs and the highly flexible but often unstructured latent spaces of models like GANs.
Was this section helpful?
© 2025 ApX Machine Learning