Standard amortized inference networks in VAEs, while efficient, often employ a simple distribution (like a diagonal Gaussian) for . This simplicity can be a bottleneck, preventing from accurately approximating a potentially complex true posterior . Two strategies to create more expressive and accurate approximate posteriors are examined: using auxiliary variables and employing semi-amortized inference schemes.
One way to enhance the flexibility of the approximate posterior without making its direct functional form overly complex is to introduce auxiliary random variables. These variables are not part of the original generative model but are used within the inference network to help shape a richer distribution for .
Let's denote these auxiliary variables by . Instead of defining directly, we define a joint distribution over both the original latent variables and these new auxiliary variables . A common factorization for this joint distribution is hierarchical:
Here, is an inference network that maps an input to the parameters of a distribution over . Then, is another inference network that maps and a sample to the parameters of a distribution over .
The resulting marginal distribution for , , can be significantly more complex and flexible than if we had modeled directly with a simple family (e.g., a single Gaussian). Think of it as using to "steer" or "refine" the inference for . For example, could capture some high-level aspects of the posterior, and could then model finer details conditioned on these aspects.
To incorporate this into the VAE framework, we adjust the Evidence Lower Bound (ELBO). We treat as additional latent variables and assume a simple prior for them (e.g., a standard Normal distribution, ). The ELBO for this augmented system is:
Note that the generative model still only depends on . The auxiliary variables are only "seen" by the inference machinery and the prior . Using our chosen factorization and assuming (i.e., and are independent in the prior), the KL divergence term can be decomposed:
So, the ELBO becomes:
This expression looks like a VAE objective where acts as an "encoder" for , and then, conditioned on , acts as another "encoder" for . The overall structure allows to implicitly represent a mixture of simpler distributions, leading to a much richer family for the approximate posterior.
Benefits:
Costs:
Models like Auxiliary Deep Generative Models (ADGMs) and some variants of hierarchical VAEs (when applied to the inference side) are examples of this approach. This technique is distinct from Normalizing Flows (which transform a simple noise distribution into a complex posterior using invertible functions) but can be complementary.
Amortized variational inference, where a single neural network directly outputs the parameters of the approximate posterior for any given , is computationally efficient. However, it makes a strong assumption: that a single set of network parameters can provide optimal (or near-optimal) variational parameters for all datapoints. This can lead to an "amortization gap", the difference in ELBO quality between what a fully amortized can achieve and what could be achieved if we optimized the variational parameters for each datapoint individually.
Semi-amortized variational inference aims to bridge this gap. The core idea is to use the amortized inference network to provide a good initialization for the variational parameters for a specific datapoint . Then, these initial parameters are refined through a few steps of optimization, specifically for that , by directly maximizing the ELBO with respect to the variational parameters for that instance.
Let denote the parameters of the approximate posterior for a single datapoint (e.g., if is Gaussian, ). The process is as follows:
The following diagram illustrates this refinement process:
The semi-amortized inference process: An amortized network provides an initial estimate of posterior parameters, which are then refined through instance-specific optimization.
Benefits:
Costs:
This approach is particularly useful when the true posterior has high variance across different datapoints, making it difficult for a single amortized network to perform well universally. The number of refinement steps is a hyperparameter; even a small number of steps (e.g., to ) can often yield substantial improvements.
Auxiliary variables and semi-amortized inference are not mutually exclusive. One could, for instance, define an expressive posterior family using auxiliary variables and then use semi-amortized inference to fine-tune the parameters of this richer posterior for each data point.
When deciding whether to use these advanced inference techniques, consider the following:
In practice, starting with a well-tuned standard VAE and then exploring these techniques can be a good strategy if further improvements in posterior approximation are needed. The choice depends on the specific application, available computational resources, and the desired trade-off between model performance and inference speed. Both methods offer valuable tools for pushing the boundaries of what VAEs can achieve by enabling more accurate and flexible posterior inference.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with