In the autoencoders we've encountered so far, the encoder takes an input x and deterministically maps it to a single point z in the latent space. This point z is a compressed representation of x. Variational Autoencoders take a different path. Instead of encoding an input as a single point, the VAE encoder describes a probability distribution in the latent space for each input.
You might wonder, why a distribution? By learning a distribution for each input, VAEs can create a more continuous and structured latent space. This means that points close to each other in the latent space will correspond to similar-looking data when decoded. This property is particularly useful for generating new data samples, as we'll see later in this chapter. The encoder's job is to learn the parameters of this distribution based on the input data.
To define this probability distribution, the VAE encoder outputs a set of parameters. For simplicity and common practice, VAEs typically assume that the learned distribution for each input x is a multivariate Gaussian distribution (also known as a normal distribution). A Gaussian distribution is characterized by two main parameters:
So, for a given input x, the encoder network doesn't just output a single latent vector z. Instead, it outputs two separate vectors:
Each of these vectors will have a dimensionality equal to the desired size of our latent space. For instance, if we aim for a 32-dimensional latent space, the encoder will output a 32-dimensional μ(x) vector and a 32-dimensional log(σ2(x)) vector.
You might be asking, "Why output log(σ2) instead of σ2 directly?" This is a good question, and there are a couple of important reasons for this choice:
Similarly, the standard deviation σ can be calculated from the log-variance as σ=exp(21log(σ2)) or σ=exp(log(σ2)).
This requirement means the encoder part of a VAE will have a slightly different architecture at its output stage compared to a standard autoencoder. Typically, the main body of the encoder network (which might consist of dense layers, convolutional layers, etc.) processes the input x and transforms it into some intermediate, high-level representation. This shared intermediate representation is then fed into two separate output layers (often called "heads"):
These final layers for μ and log(σ2) usually do not have an activation function applied to their outputs (or they use a linear activation, which is equivalent to no activation). This is because the mean μ can take any real value, and the log-variance log(σ2) can also take any real value.
Here's a diagram illustrating this branching structure:
The VAE encoder processes the input through shared layers, then splits into two heads. One head predicts the mean vector μ(x) and the other predicts the log-variance vector log(σ2(x)) for the latent distribution associated with input x.
For every input data point x that we feed into our VAE, the encoder doesn't just give us one compressed representation. It effectively tells us: "For this input x, the corresponding representation in the latent space should be sampled from a Gaussian distribution that is centered at μ(x) and has a variance of σ2(x) along each latent dimension."
More formally, the encoder learns to map each input x to a specific conditional probability distribution q(z∣x). This distribution is assumed to be a Gaussian: q(z∣x)=N(z;μ(x),diag(σ2(x))) Here, N(z;μ,Σ) denotes a multivariate Gaussian distribution over z with mean μ and covariance matrix Σ. The term diag(σ2(x)) indicates that we are typically working with a diagonal covariance matrix. This means the individual dimensions of the latent space are assumed to be conditionally independent given the input x, simplifying the model. Each diagonal element of this matrix is one of the σi2(x) values from the variance vector σ2(x).
Once the encoder has produced these parameters μ(x) and log(σ2(x)), the next logical step in the VAE's forward pass is to sample a latent vector z from this defined distribution N(μ(x),σ2(x)). This sampled z is then what gets passed to the VAE's decoder, which will attempt to reconstruct the original input (or generate a new sample) from z.
However, the act of sampling is inherently a random process. A significant challenge arises: how do we backpropagate gradients through this random sampling step during training? This is where a clever mathematical slight-of-hand known as the "reparameterization trick" comes into play. We'll explore this essential component of VAEs in the very next section, as it's what makes end-to-end training of these probabilistic encoders possible.
Was this section helpful?
© 2025 ApX Machine Learning