As we've discussed, the expressiveness of the approximate posterior qϕ(z∣x) is a critical factor in the performance of Variational Autoencoders. Standard VAEs often employ simple, explicit distributions for qϕ(z∣x), like a diagonal Gaussian, primarily because this choice makes the Kullback-Leibler (KL) divergence term DKL(qϕ(z∣x)∣∣p(z)) in the Evidence Lower Bound (ELBO) analytically tractable. However, this simplicity can severely limit how well qϕ(z∣x) can model the true, often complex, posterior pθ(z∣x). Adversarial Variational Bayes (AVB) offers a sophisticated approach to overcome this limitation by enabling the use of highly flexible, implicit approximate posteriors.
To dramatically increase the flexibility of the approximate posterior, we can define it implicitly. Instead of specifying qϕ(z∣x) with an explicit probability density function, we define a sampling procedure:
z=gϕ(x,ϵ)where gϕ is a neural network (the encoder) parameterized by ϕ, x is the input data, and ϵ is a noise variable sampled from a simple distribution, like a standard Normal distribution p(ϵ). This construction allows qϕ(z∣x) to represent virtually any complex distribution.
The hurdle with such an implicit qϕ(z∣x) is that its probability density function qϕ(z∣x) (and thus logqϕ(z∣x)) is generally unknown and intractable to compute. This is a problem because the ELBO, which we aim to maximize, is:
LELBO(x)=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))The KL divergence term expands to:
DKL(qϕ(z∣x)∣∣p(z))=Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)]Without access to logqϕ(z∣x), we cannot directly compute or optimize this KL divergence.
Adversarial Variational Bayes, as introduced by Mescheder, Nowozin, and Geiger (2017), provides an elegant way to handle this intractability. The core idea is to train an auxiliary neural network, often called a critic or discriminator Tψ(x,z) (parameterized by ψ), to approximate the intractable log-density logqϕ(z∣x).
The critic network Tψ(x,z) is not a typical GAN discriminator trying to distinguish "real" from "fake" samples. Instead, it's trained to estimate the log-density of the samples produced by the implicit encoder gϕ(x,ϵ). This is achieved by optimizing Tψ(x,z) with respect to the following objective function, which should be minimized:
LT(ψ)=Ex∼pdata(x)[Eϵ∼p(ϵ)[exp(Tψ(x,gϕ(x,ϵ)))−Tψ(x,gϕ(x,ϵ))]]It can be shown that when this objective LT(ψ) is minimized, the optimal critic Tψ∗(x,z) satisfies:
Tψ∗(x,z)≈logqϕ(z∣x)−1The constant −1 is not an issue, as it will either cancel out or result in a constant offset in the ELBO, which doesn't affect the gradients with respect to ϕ or θ (except for the entropy part of the ELBO, where it's properly accounted for).
Once we have the critic Tψ(x,z) providing an estimate Tψ(x,z)+1≈logqϕ(z∣x), we can substitute this into the ELBO formula. The objective for the VAE's encoder parameters ϕ (which define gϕ) and decoder parameters θ (which define pθ(x∣z)) is to maximize:
LAVB(θ,ϕ)=Ex∼pdata(x)[Eϵ∼p(ϵ)[logpθ(x∣gϕ(x,ϵ))+logp(gϕ(x,ϵ))−(Tψ(x,gϕ(x,ϵ))+1)]]Here, z=gϕ(x,ϵ). The term p(gϕ(x,ϵ)) is the prior density evaluated at the generated latent sample z.
The overall training process for AVB becomes a minimax (or saddle-point) optimization problem:
These two steps are typically alternated during training.
The following diagram illustrates the AVB architecture and flow:
Data flow and objective functions in Adversarial Variational Bayes. The critic Tψ(x,z) learns to estimate the log-density of the implicit posterior qϕ(z∣x), which is then used in the VAE's objective function. C represents the constant term.
It's important to clarify what "adversarial" means in the context of AVB. It's not adversarial in the same sense as Generative Adversarial Networks (GANs), where a discriminator tries to distinguish real data from generated data. In AVB:
The "adversarial" aspect arises from this interplay: the VAE parameters (θ,ϕ) are updated to maximize an objective that depends on the critic Tψ, while Tψ is concurrently updated to better model the density defined by ϕ. This dynamic resembles a two-player game, forming the basis of the saddle-point optimization.
Employing AVB can lead to significant improvements in VAEs:
While powerful, AVB also introduces some challenges:
Despite these challenges, AVB represents a significant step towards more powerful and flexible variational inference in VAEs. By moving beyond simple, explicit posteriors, AVB allows VAEs to model more complex data distributions and learn richer latent representations, pushing the boundaries of what can be achieved with these generative models. It is a prime example of how sophisticated inference techniques can unlock new capabilities in VAEs.
Was this section helpful?
© 2025 ApX Machine Learning