Autoencoders typically take an input and deterministically map it to a single point in the latent space. This point is a compressed representation of . Variational Autoencoders take a different path. Instead of encoding an input as a single point, the VAE encoder describes a probability distribution in the latent space for each input.
You might wonder, why a distribution? By learning a distribution for each input, VAEs can create a more continuous and structured latent space. This means that points close to each other in the latent space will correspond to similar-looking data when decoded. This property is particularly useful for generating new data samples, as we'll see later in this chapter. The encoder's job is to learn the parameters of this distribution based on the input data.
To define this probability distribution, the VAE encoder outputs a set of parameters. For simplicity and common practice, VAEs typically assume that the learned distribution for each input is a multivariate Gaussian distribution (also known as a normal distribution). A Gaussian distribution is characterized by two main parameters:
So, for a given input , the encoder network doesn't just output a single latent vector . Instead, it outputs two separate vectors:
Each of these vectors will have a dimensionality equal to the desired size of our latent space. For instance, if we aim for a 32-dimensional latent space, the encoder will output a 32-dimensional vector and a 32-dimensional vector.
You might be asking, "Why output instead of directly?" This is a good question, and there are a couple of important reasons for this choice:
Similarly, the standard deviation can be calculated from the log-variance as or .
This requirement means the encoder part of a VAE will have a slightly different architecture at its output stage compared to a standard autoencoder. Typically, the main body of the encoder network (which might consist of dense layers, convolutional layers, etc.) processes the input and transforms it into some intermediate, high-level representation. This shared intermediate representation is then fed into two separate output layers (often called "heads"):
These final layers for and usually do not have an activation function applied to their outputs (or they use a linear activation, which is equivalent to no activation). This is because the mean can take any real value, and the log-variance can also take any real value.
Here's a diagram illustrating this branching structure:
The VAE encoder processes the input through shared layers, then splits into two heads. One head predicts the mean vector and the other predicts the log-variance vector for the latent distribution associated with input .
For every input data point that we feed into our VAE, the encoder doesn't just give us one compressed representation. It effectively tells us: "For this input , the corresponding representation in the latent space should be sampled from a Gaussian distribution that is centered at and has a variance of along each latent dimension."
More formally, the encoder learns to map each input to a specific conditional probability distribution . This distribution is assumed to be a Gaussian: Here, denotes a multivariate Gaussian distribution over with mean and covariance matrix . The term indicates that we are typically working with a diagonal covariance matrix. This means the individual dimensions of the latent space are assumed to be conditionally independent given the input , simplifying the model. Each diagonal element of this matrix is one of the values from the variance vector .
Once the encoder has produced these parameters and , the next logical step in the VAE's forward pass is to sample a latent vector from this defined distribution . This sampled is then what gets passed to the VAE's decoder, which will attempt to reconstruct the original input (or generate a new sample) from .
However, the act of sampling is inherently a random process. A significant challenge arises: how do we backpropagate gradients through this random sampling step during training? This is where a clever mathematical slight-of-hand known as the "reparameterization trick" comes into play. We'll explore this essential component of VAEs in the very next section, as it's what makes end-to-end training of these probabilistic encoders possible.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with