In our exploration of Variational Autoencoders, we've established that the ELBO, LELBO, serves as our objective function. This objective involves an expectation over the approximate posterior qϕ(z∣x):
LELBO=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))To optimize this objective using gradient-based methods like stochastic gradient descent (SGD), we need to compute gradients with respect to the parameters of both the encoder (ϕ) and the decoder (θ). The KL divergence term, DKL(qϕ(z∣x)∣∣p(z)), can often be computed analytically (as we'll see for Gaussian distributions) and its gradients with respect to ϕ derived.
However, the reconstruction term, Eqϕ(z∣x)[logpθ(x∣z)], presents a challenge. Specifically, how do we backpropagate gradients through the sampling process z∼qϕ(z∣x) to update ϕ? The act of sampling from qϕ(z∣x) is a stochastic operation, and naively, it creates a non-differentiable point in our computation graph. This is where the reparameterization trick comes into play.
Imagine the encoder network outputs the parameters of the distribution qϕ(z∣x), for instance, the mean μϕ(x) and standard deviation σϕ(x) if qϕ(z∣x) is Gaussian. Then, we sample z from this distribution. The decoder takes this z and tries to reconstruct x.
If we try to backpropagate the reconstruction loss, the gradient needs to pass from the decoder, through z, and then to the parameters μϕ(x) and σϕ(x) of the encoder. The sampling step z∼qϕ(z∣x) itself is problematic because the operation of drawing a random sample is not directly differentiable with respect to the parameters of the distribution from which we are sampling. The path for gradient flow is effectively cut.
Consider a simplified view: x→Encoder(ϕ)→params of qϕ(z∣x)Sample zDecoder(θ)→logpθ(x∣z). The gradient ∇ϕlogpθ(x∣z) is what we need for the reconstruction term, but the sampling step blocks this.
The reparameterization trick provides an elegant solution by reframing the sampling process. Instead of sampling z directly from qϕ(z∣x), we introduce an auxiliary noise variable ϵ sampled from a simple, fixed distribution p(ϵ) (e.g., a standard normal distribution N(0,I)). Then, we express z as a deterministic function gϕ(x,ϵ) that transforms this noise ϵ using the parameters from the encoder.
So, z=gϕ(x,ϵ), where ϵ∼p(ϵ). The key is that gϕ(x,ϵ) is a deterministic function, and its parameters (derived from ϕ) are part of this function. Now, the stochasticity is isolated to ϵ, which does not depend on ϕ.
This allows us to rewrite the expectation:
Eqϕ(z∣x)[logpθ(x∣z)]=Ep(ϵ)[logpθ(x∣gϕ(x,ϵ))]When we take a Monte Carlo estimate of this expectation using a single sample ϵ(l)∼p(ϵ), the term becomes logpθ(x∣gϕ(x,ϵ(l))). Now, we can compute the gradient ∇ϕlogpθ(x∣gϕ(x,ϵ(l))) because gϕ(x,ϵ(l)) is a deterministic function of ϕ (given x and ϵ(l)). The gradient can flow through the function g back to the parameters ϕ.
This trick is most commonly illustrated and used when qϕ(z∣x) is a multivariate Gaussian distribution. Let's assume a diagonal covariance structure:
qϕ(z∣x)=N(z∣μϕ(x),diag(σϕ,12(x),...,σϕ,J2(x)))Here, the encoder network, parameterized by ϕ, outputs the mean vector μϕ(x) and the variance vector (or standard deviation vector) σϕ2(x) for each input x.
To reparameterize, we first sample a noise vector ϵ∼N(0,I), where I is the identity matrix and ϵ has the same dimensionality as z. Then, we can generate z as:
z=μϕ(x)+σϕ(x)⊙ϵwhere σϕ(x) is the vector of standard deviations (element-wise square root of variances) and ⊙ denotes the element-wise product.
Now, z is still a random variable distributed according to qϕ(z∣x), but it's expressed as a deterministic transformation of μϕ(x), σϕ(x), and the independent noise ϵ. The gradients of the loss with respect to μϕ(x) and σϕ(x) (and thus ϕ) can be computed via standard backpropagation through this transformation.
In practice, encoders often output μϕ(x) and logσϕ2(x) (log-variance). This is for numerical stability and to ensure that the variance σϕ2(x) is always positive. If the network outputs logσϕ2(x), then σϕ(x)=exp(0.5⋅logσϕ2(x)).
The following diagram illustrates how the reparameterization trick alters the computation graph to allow gradient flow.
The diagram shows the computational path. Before reparameterization, the sampling node for z blocks gradient flow to encoder parameters ϕ. After reparameterization, z is computed deterministically from encoder outputs and an independent noise source ϵ, enabling gradients to flow back to ϕ.
The reparameterization trick is an instance of what's known in the literature as a pathwise derivative estimator. The core idea is to move the differentiation operator inside the expectation. We want to compute:
∇ϕEqϕ(z∣x)[f(z)]where f(z)=logpθ(x∣z). If z=gϕ(x,ϵ) with ϵ∼p(ϵ), then the expectation becomes:
∇ϕEp(ϵ)[f(gϕ(x,ϵ))]Since p(ϵ) does not depend on ϕ, and assuming f and gϕ are sufficiently well-behaved (differentiable), we can swap the gradient and expectation operators:
Ep(ϵ)[∇ϕf(gϕ(x,ϵ))]Now, the gradient is inside the expectation. We can approximate this expectation using Monte Carlo sampling: draw L samples ϵ(1),...,ϵ(L) from p(ϵ), and the gradient estimate is:
L1l=1∑L∇ϕf(gϕ(x,ϵ(l)))In practice, for VAE training, we often use L=1 sample of ϵ per data point x in a mini-batch. The term ∇ϕf(gϕ(x,ϵ(l))) can be computed using the chain rule:
∇ϕf(gϕ(x,ϵ(l)))=∂z∂fz=gϕ(x,ϵ(l))∂ϕ∂gϕ(x,ϵ(l))This is precisely what automatic differentiation libraries (like those in TensorFlow or PyTorch) do during backpropagation.
The primary benefit of the reparameterization trick is that it allows for end-to-end training of VAEs using standard backpropagation algorithms. It typically results in gradient estimators with much lower variance compared to alternative methods for handling stochastic nodes, such as the score function estimator (also known as REINFORCE or the log-derivative trick, which is different from this pathwise derivative context). Lower variance gradients generally lead to more stable and faster training.
The reparameterization trick is applicable whenever we can express the random variable z as a deterministic and differentiable transformation of its parameters and an independent noise source. This works well for many continuous distributions:
However, it's not universally applicable. For discrete latent variables, where z takes on values from a discrete set, this direct reparameterization is not possible because the mapping from a continuous ϵ to a discrete z would involve non-differentiable operations (like rounding or argmax). For such cases, other techniques like the Gumbel-Softmax trick (a continuous relaxation) or score function estimators are employed, though they come with their own sets of challenges.
By applying the reparameterization trick, the reconstruction term Eqϕ(z∣x)[logpθ(x∣z)] in the ELBO becomes differentiable with respect to the encoder parameters ϕ. The KL divergence term DKL(qϕ(z∣x)∣∣p(z)) is often analytically differentiable if qϕ(z∣x) and p(z) are chosen appropriately (e.g., both Gaussians). For example, if p(z)=N(0,I) and qϕ(z∣x)=N(μϕ(x),diag(σϕ2(x))), the KL divergence is:
DKL(qϕ(z∣x)∣∣p(z))=21j=1∑J(μϕ,j(x)2+σϕ,j2(x)−log(σϕ,j2(x))−1)This expression is clearly differentiable with respect to μϕ,j(x) and σϕ,j2(x). Thus, the entire ELBO can be optimized using gradient ascent (or gradient descent on the negative ELBO) with respect to both ϕ and θ.
In summary, the reparameterization trick is a fundamental technique in training VAEs. It cleverly sidesteps the issue of differentiating through a sampling process by reformulating the generation of latent variables z. This makes the ELBO objective amenable to standard gradient-based optimization, allowing gradients to flow from the reconstruction loss back to the parameters of the encoder network. This innovation was a significant step in making VAEs practical and effective.
Was this section helpful?
© 2025 ApX Machine Learning