All Courses

Practice: Implementing IWAEs and Advanced Inference

In this chapter, we've explored several methods to enhance the inference capabilities of Variational Autoencoders. A key takeaway is that the quality of the approximate posterior $q_\phi(z|x)$ significantly impacts VAE performance. While amortized inference provides efficiency, techniques like Importance Weighted Autoencoders (IWAEs) offer a path to a tighter Evidence Lower Bound (ELBO) and potentially more accurate posterior approximations. This practical section will guide you through implementing an IWAE and exploring its effects. We'll also touch upon considerations for other advanced inference strategies.

Understanding the IWAE Objective

Recall that the standard VAE maximizes the ELBO:

\mathcal{L}_{ELBO}(x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))

The IWAE, introduced by Burda et al. (2015), provides a tighter lower bound on the log marginal likelihood $\log p_\theta(x)$ by using multiple samples from the approximate posterior $q_\phi(z|x)$ for each data point $x$ . The IWAE objective, often denoted $\mathcal{L}_K$ , is:

\mathcal{L}_K(x) = \mathbb{E}_{z_1, ..., z_K \sim q_\phi(z|x)} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \right) \right]

where $z_k$ are $K$ independent samples drawn from $q_\phi(z|x)$ . This can be rewritten in a more practical form for implementation:

\mathcal{L}_K(x) = \mathbb{E}_{z_1, ..., z_K \sim q_\phi(z|x)} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K \exp(\log p_\theta(x|z_k) + \log p(z_k) - \log q_\phi(z_k|x)) \right) \right]

Notice that when $K=1$ , $\mathcal{L}_1$ recovers the standard ELBO (up to a subtle difference in how the expectation is handled in practice, but for optimization, it's very similar). As $K \to \infty$ , $\mathcal{L}_K \to \log p_\theta(x)$ .

Implementing an IWAE

Let's outline the steps to modify a standard VAE implementation (which you might have from the practical in Chapter 2) into an IWAE. We'll assume you have an encoder that outputs parameters (e.g., mean $\mu_\phi(x)$ and log-variance $\log \sigma^2_\phi(x)$ ) for $q_\phi(z|x)$ , and a decoder that models $p_\theta(x|z)$ .

1. Sampling Multiple Latent Variables

For each input $x$ in a mini-batch, instead of drawing one sample $z$ from $q_\phi(z|x)$ , you need to draw $K$ samples. If your encoder outputs $\mu_\phi(x)$ and $\sigma_\phi(x)$ for a Gaussian $q_\phi(z|x)$ , the reparameterization trick is applied $K$ times: $z_k = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon_k$ , where $\epsilon_k \sim \mathcal{N}(0, I)$ for $k=1, \dots, K$ .

In terms of tensor operations (e.g., in PyTorch or TensorFlow):

If your encoder outputs mu and logvar of shape (batch_size, latent_dim).
You'll need to expand these to (batch_size, K, latent_dim) or process them in a way that allows for $K$ samples per input.
Generate eps of shape (batch_size, K, latent_dim).
Compute z_samples of shape (batch_size, K, latent_dim).

# Pseudocode/PyTorch-like illustration
# mu, logvar shapes: (batch_size, latent_dim)
# K: number of importance samples

# Expand mu and logvar for K samples
mu_expanded = mu.unsqueeze(1).expand(-1, K, -1)  # (batch_size, K, latent_dim)
logvar_expanded = logvar.unsqueeze(1).expand(-1, K, -1) # (batch_size, K, latent_dim)
std_expanded = torch.exp(0.5 * logvar_expanded)

# Sample epsilon
epsilon = torch.randn_like(std_expanded) # (batch_size, K, latent_dim)

# Generate K latent samples per input
z_samples = mu_expanded + std_expanded * epsilon # (batch_size, K, latent_dim)

2. Calculating Per-Sample Log Weights

For each sample $z_k$ , we need to compute its unnormalized log importance weight, $w'_k$ :

\log w'_k = \log p_\theta(x|z_k) + \log p(z_k) - \log q_\phi(z_k|x)

$\log p_\theta(x|z_k)$ : The reconstruction log-likelihood. This involves passing each $z_k$ through the decoder to get parameters of $p_\theta(x|z_k)$ and then computing the log probability of $x$ . If $x$ is reshaped for the decoder, ensure it aligns with the $K$ samples. For example, if $x$ has shape (batch_size, data_dim), it might need to be expanded to (batch_size, K, data_dim) to match z_samples when calculating reconstruction loss for each sample.
$\log p(z_k)$ : The log-prior probability of $z_k$ . Typically, $p(z) = \mathcal{N}(0, I)$ , so this is straightforward to compute.
$\log q_\phi(z_k|x)$ : The log-probability of $z_k$ under the approximate posterior $q_\phi(z|x)$ .

These components will result in a tensor of log weights, say log_w_prime, of shape (batch_size, K).

3. Averaging with Log-Sum-Exp

The IWAE objective involves $\log (\frac{1}{K} \sum_k \exp(\log w'_k))$ . Directly summing exponentials can lead to numerical underflow or overflow. The log-sum-exp (LSE) trick is essential here:

\log \left( \sum_{k=1}^K \exp(a_k) \right) = \alpha + \log \left( \sum_{k=1}^K \exp(a_k - \alpha) \right)

where $\alpha = \max_k a_k$ . The IWAE loss for a single data point $x$ is then:

\mathcal{L}_K(x) = \text{LSE}(\log w'_1, \dots, \log w'_K) - \log K

The final loss for the mini-batch is the average of $\mathcal{L}_K(x)$ over all $x$ in the batch.

# Pseudocode/PyTorch-like illustration for the IWAE loss term
# log_p_x_given_z: (batch_size, K), reconstruction log-likelihood for each sample
# log_p_z: (batch_size, K), prior log-prob for each sample
# log_q_z_given_x: (batch_size, K), approximate posterior log-prob for each sample

log_w_prime = log_p_x_given_z + log_p_z - log_q_z_given_x # (batch_size, K)

# Log-sum-exp for numerical stability
log_sum_exp_w = torch.logsumexp(log_w_prime, dim=1) # (batch_size,)

# IWAE objective for each data point
iwae_elbo_per_sample = log_sum_exp_w - torch.log(torch.tensor(K, dtype=torch.float32))

# Average over batch
batch_iwae_elbo = torch.mean(iwae_elbo_per_sample)

# The loss to minimize is -batch_iwae_elbo
loss = -batch_iwae_elbo

A Note on Dimensions: Carefully manage tensor dimensions. When you pass $z_k$ (shape (batch_size, K, latent_dim)) to the decoder, it might process it as (batch_size * K, latent_dim). The output reconstructions will then be (batch_size * K, data_dim). You'll need to reshape $x$ to match this for computing $\log p_\theta(x|z_k)$ , and then reshape the resulting log-likelihoods back to (batch_size, K).

Experimentation and Analysis

Once you have your IWAE implemented, it's time to experiment:

Vary $K$ : Train IWAE models with different values of $K$ (e.g., $K=1, 5, 10, 50$ ).
- Standard VAE as baseline: The $K=1$ case effectively simulates your standard VAE (though, as mentioned, there are subtle theoretical differences in the estimator, but practically it's often used as the VAE baseline).
- Monitor the bound: Plot the reported $\mathcal{L}_K$ value on a validation set. You should observe that $\mathcal{L}_K$ generally increases with $K$ , indicating a tighter bound.
- Reconstruction Quality: Visually inspect reconstructions. Are they sharper with higher $K$ ? Quantify with MSE if appropriate, but be mindful that IWAE optimizes a bound on $\log p(x)$ , not directly reconstruction error.
- Sample Quality: Generate samples from the prior $p(z)$ and decode them. Does the quality of generated samples improve with $K$ ?
- Training Time: Note the increase in training time per epoch as $K$ increases. This is a direct trade-off.
The plot above illustrates a typical trend: as $K$ increases, the IWAE bound ( $\mathcal{L}_K$ ) improves (becomes less negative), but computational cost per epoch also rises.
Active Units: If you are also interested in disentanglement or representation learning (covered in Chapter 5), investigate if using IWAE (with $K > 1$ ) affects the number of "active units" in the latent space compared to a standard VAE. Sometimes, a tighter bound can prevent premature KL vanishing.
Posterior Collapse: For models prone to posterior collapse (where $q_\phi(z|x)$ becomes very similar to $p(z)$ , making the latent variables uninformative), does IWAE with $K > 1$ help mitigate this issue? The more accurate gradients provided by IWAE might offer better optimization paths.

Considerations for Other Advanced Inference Techniques

While IWAE focuses on improving the bound via multiple samples, other techniques modify the structure of $q_\phi(z|x)$ or the inference process itself:

Structured Variational Inference:
- Normalizing Flows: Implementing a VAE with normalizing flows in the encoder involves transforming simple Gaussian samples into more complex distributions. You'd insert flow layers (e.g., planar flows, radial flows, or more advanced autoregressive flows like MAF or IAF) after the initial sampling from $N(\mu_\phi(x), \sigma^2_\phi(x))$ . The primary challenge is computing the log-determinant of the Jacobian of these transformations, which needs to be added to $\log q_\phi(z_0|x)$ to get $\log q_\phi(z_K|x)$ (where $z_0$ is the base sample and $z_K$ is the transformed sample).
- Autoregressive Posteriors: Instead of a diagonal covariance Gaussian, $q_\phi(z|x) = \prod_i q_\phi(z_i | z_{<i}, x)$ . This requires an autoregressive network (like an RNN or a masked autoencoder) for the encoder.
Auxiliary Variables (e.g., Semi-Amortized VI, Hierarchical VAEs):
- Introducing auxiliary variables $u$ to augment the latent space, e.g., $q_\phi(z, u | x) = q_\phi(u|x)q_\phi(z|x,u)$ . This often leads to more expressive posteriors. Implementation involves designing inference networks for these auxiliary variables and modifying the ELBO to account for them.
- For semi-amortized VI, you might perform a few steps of optimization (e.g., SGD) to refine $z$ for each data point, starting from an amortized proposal. This is computationally more intensive but can yield very accurate posteriors.
Adversarial Variational Bayes (AVB):
- AVB replaces the KL divergence term in the ELBO with an adversarial discriminator that tries to distinguish samples from $q_\phi(z|x)$ from samples from the true prior $p(z)$ . The encoder is then trained to fool this discriminator. This requires setting up a GAN-like training loop within your VAE framework.

Implementing these advanced techniques often involves a deeper dive into probabilistic modeling and careful network architecture design. The common thread is moving beyond the simple mean-field Gaussian assumption for $q_\phi(z|x)$ to capture more complex posterior structures or to obtain better estimates of the model evidence.

Moving Forward

This practical exercise with IWAEs should provide a solid foundation for understanding how to improve VAE inference. By tightening the ELBO, IWAEs can lead to better generative models and more faithful latent representations. The increased computational cost is a trade-off, but often, a moderate $K$ (e.g., 5-10) can offer a good balance.

As you progress, consider how the principles of IWAE (multiple samples for better estimates) or the structural enhancements from other advanced methods could be combined or adapted for your specific VAE applications. The ability to critically assess and improve the inference mechanism is a hallmark of advanced VAE development.

Was this section helpful?