In previous chapters, we explored standard autoencoders and their regularized forms (like Denoising or Sparse AEs). Their primary objective is efficient data compression and reconstruction. The encoder network, Eϕ, maps an input x to a deterministic latent code z=Eϕ(x), and the decoder network, Dθ, attempts to reconstruct the original input as x^=Dθ(z). Training minimizes a reconstruction loss, ensuring x^ is close to x.
While effective for dimensionality reduction and feature learning, this deterministic setup presents significant challenges when our goal shifts to generating new data samples that resemble the training distribution. How would we generate a novel image, for example? The most direct idea is to:
The problem lies in step 2: how do we choose a "good" z? Can we simply pick a random vector z and expect a realistic output? Usually, the answer is no. Standard autoencoders, optimized solely for reconstruction, often produce latent spaces that are not conducive to generation.
Here are the main reasons why:
The encoder Eϕ learns to map training inputs xi to specific points zi in the latent space. The decoder Dθ learns to map these specific zi back to x^i≈xi. However, the autoencoder objective function doesn't explicitly require the latent space to be continuous or smooth. This means:
The distribution of encoded points {zi=Eϕ(xi)} in the latent space can be quite irregular. It might consist of disconnected clusters or follow a complex, sparse manifold structure. There can be large "holes" or gaps between regions corresponding to valid data representations.
The latent space of a standard autoencoder often contains clusters corresponding to training data, but the regions between these clusters (like the red crosses) may not decode to meaningful outputs.
If we randomly sample a z that falls into one of these gaps (like the red 'x' markers in the conceptual plot above), the decoder, having never been trained on inputs from that region, has no incentive to produce a coherent output.
Standard autoencoders lack a probabilistic foundation for the latent space. The encoder provides a single point estimate z for each input x. There's no mechanism to model the distribution of likely latent codes p(z) or the conditional distribution p(z∣x). Without a well-defined distribution to sample from, generating new data becomes a hit-or-miss process. We don't know where in the latent space the valid codes reside, other than the locations encoded from the training data itself.
Regularization techniques like sparsity or denoising slightly modify the latent space properties but don't fundamentally address this generative limitation. Sparse AEs encourage most latent units to be inactive, while Denoising AEs learn to map corrupted inputs to cleaner reconstructions, potentially smoothing the representation slightly, but neither imposes the kind of probabilistic structure needed for reliable sampling and generation.
To overcome these limitations and build effective generative models, we need a different approach. We require a way to:
This is precisely what Variational Autoencoders (VAEs) are designed to do. They introduce a probabilistic perspective to both the encoder and the decoder, explicitly modeling the latent variables as distributions and optimizing an objective function that encourages both good reconstruction and a structured latent space suitable for generation. We will explore this probabilistic framework in the following sections.
© 2025 ApX Machine Learning