The expressiveness of the approximate posterior qϕ(z∣x) is a critical factor in the performance of Variational Autoencoders. Standard VAEs often employ simple, explicit distributions for qϕ(z∣x), like a diagonal Gaussian, primarily because this choice makes the Kullback-Leibler (KL) divergence term DKL(qϕ(z∣x)∣∣p(z)) in the Evidence Lower Bound (ELBO) analytically tractable. However, this simplicity can severely limit how well qϕ(z∣x) can model the true, often complex, posterior pθ(z∣x). Adversarial Variational Bayes (AVB) offers a sophisticated approach to overcome this limitation by enabling the use of highly flexible, implicit approximate posteriors.
The Challenge: Implicit Posteriors and Intractable Densities
To dramatically increase the flexibility of the approximate posterior, we can define it implicitly. Instead of specifying qϕ(z∣x) with an explicit probability density function, we define a sampling procedure:
z=gϕ(x,ϵ)
where gϕ is a neural network (the encoder) parameterized by ϕ, x is the input data, and ϵ is a noise variable sampled from a simple distribution, like a standard Normal distribution p(ϵ). This construction allows qϕ(z∣x) to represent virtually any complex distribution.
The hurdle with such an implicit qϕ(z∣x) is that its probability density function qϕ(z∣x) (and thus logqϕ(z∣x)) is generally unknown and intractable to compute. This is a problem because the ELBO, which we aim to maximize, is:
LELBO(x)=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))
The KL divergence term expands to:
DKL(qϕ(z∣x)∣∣p(z))=Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)]
Without access to logqϕ(z∣x), we cannot directly compute or optimize this KL divergence.
Adversarial Variational Bayes: Learning the Log-Posterior
Adversarial Variational Bayes, as introduced by Mescheder, Nowozin, and Geiger (2017), provides an elegant way to handle this intractability. The core idea is to train an auxiliary neural network, often called a critic or discriminator Tψ(x,z) (parameterized by ψ), to approximate the intractable log-density logqϕ(z∣x).
The Critic Network Tψ(x,z)
The critic network Tψ(x,z) is not a typical GAN discriminator trying to distinguish "real" from "fake" samples. Instead, it's trained to estimate the log-density of the samples produced by the implicit encoder gϕ(x,ϵ). This is achieved by optimizing Tψ(x,z) with respect to the following objective function, which should be minimized:
LT(ψ)=Ex∼pdata(x)[Eϵ∼p(ϵ)[exp(Tψ(x,gϕ(x,ϵ)))−Tψ(x,gϕ(x,ϵ))]]
It can be shown that when this objective LT(ψ) is minimized, the optimal critic Tψ∗(x,z) satisfies:
Tψ∗(x,z)≈logqϕ(z∣x)−1
The constant −1 is not an issue, as it will either cancel out or result in a constant offset in the ELBO, which doesn't affect the gradients with respect to ϕ or θ (except for the entropy part of the ELBO, where it's properly accounted for).
The AVB Objective for VAE Parameters
Once we have the critic Tψ(x,z) providing an estimate Tψ(x,z)+1≈logqϕ(z∣x), we can substitute this into the ELBO formula. The objective for the VAE's encoder parameters ϕ (which define gϕ) and decoder parameters θ (which define pθ(x∣z)) is to maximize:
LAVB(θ,ϕ)=Ex∼pdata(x)[Eϵ∼p(ϵ)[logpθ(x∣gϕ(x,ϵ))+logp(gϕ(x,ϵ))−(Tψ(x,gϕ(x,ϵ))+1)]]
Here, z=gϕ(x,ϵ). The term p(gϕ(x,ϵ)) is the prior density evaluated at the generated latent sample z.
Training Dynamics: A Minimax Game
The overall training process for AVB becomes a minimax (or saddle-point) optimization problem:
- Update Critic: Minimize LT(ψ) with respect to the critic's parameters ψ. This step aims to make Tψ(x,z) a better approximation of logqϕ(z∣x)−1.
- Update VAE: Maximize LAVB(θ,ϕ) with respect to the encoder parameters ϕ and decoder parameters θ. This step updates the VAE, using the current critic's estimate for the logqϕ(z∣x) term.
These two steps are typically alternated during training.
The following diagram illustrates the AVB architecture and flow:
Data flow and objective functions in Adversarial Variational Bayes. The critic Tψ(x,z) learns to estimate the log-density of the implicit posterior qϕ(z∣x), which is then used in the VAE's objective function. C represents the constant term.
The "Adversarial" Nature of AVB
It's important to clarify what "adversarial" means in the context of AVB. It's not adversarial in the same sense as Generative Adversarial Networks (GANs), where a discriminator tries to distinguish real data from generated data. In AVB:
- The encoder gϕ (which defines qϕ(z∣x)) generates latent samples z.
- The critic Tψ(x,z) tries to accurately estimate logqϕ(z∣x) for these samples.
- As the encoder gϕ updates, the distribution qϕ(z∣x) changes, so the critic Tψ(x,z) must adapt to these changes.
The "adversarial" aspect arises from this interaction: the VAE parameters (θ,ϕ) are updated to maximize an objective that depends on the critic Tψ, while Tψ is concurrently updated to better model the density defined by ϕ. This dynamic resembles a two-player game, forming the basis of the saddle-point optimization.
Advantages of AVB
Employing AVB can lead to significant improvements in VAEs:
- Highly Flexible Posteriors: The primary benefit is the ability to use an implicit qϕ(z∣x) defined by a powerful neural network gϕ(x,ϵ). This allows the approximate posterior to capture complex structures, such as multimodality or non-Gaussian shapes, potentially matching the true posterior pθ(z∣x) much more closely.
- Direct ELBO Optimization (Approximation): AVB still optimizes an approximation of the ELBO. By using a more expressive qϕ(z∣x), AVB can achieve tighter ELBO values compared to VAEs with simpler, explicit posteriors, provided the critic's approximation of logqϕ(z∣x) is accurate.
- Improved Representations and Samples: A more accurate posterior approximation often translates to higher-quality learned latent representations and, consequently, better generative performance (e.g., sharper, more diverse generated samples x′).
Challenges
While powerful, AVB also introduces some challenges:
- Training Stability: Minimax optimization problems can sometimes be challenging to train stably. Careful hyperparameter tuning for learning rates and network architectures for both the VAE components and the critic is often necessary.
- Accuracy of the Critic: The effectiveness of AVB relies on the critic Tψ(x,z) providing a good estimate of logqϕ(z∣x). If the critic is poorly trained or lacks capacity, the approximation can be inaccurate, potentially hindering VAE training.
- Computational Cost: Training an additional neural network (the critic) and performing the minimax optimization adds to the computational overhead per training iteration compared to standard VAEs.
- Hyperparameter Sensitivity: The relationship between the VAE and the critic can introduce new hyperparameters that require careful tuning.
Despite these challenges, AVB represents a significant step towards more powerful and flexible variational inference in VAEs. By moving past simple, explicit posteriors, AVB allows VAEs to model more complex data distributions and learn richer latent representations, expanding what can be achieved with these generative models. It is a prime example of how sophisticated inference techniques can enable new capabilities in VAEs.