All Courses

The Reparameterization Trick

In our exploration of Variational Autoencoders, we've established that the ELBO, $L_{ELBO}$ , serves as our objective function. This objective involves an expectation over the approximate posterior $q_\phi(z|x)$ :

L_{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))

To optimize this objective using gradient-based methods like stochastic gradient descent (SGD), we need to compute gradients with respect to the parameters of both the encoder ( $\phi$ ) and the decoder ( $\theta$ ). The KL divergence term, $D_{KL}(q_\phi(z|x) || p(z))$ , can often be computed analytically (as we'll see for Gaussian distributions) and its gradients with respect to $\phi$ derived.

However, the reconstruction term, $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ , presents a challenge. Specifically, how do we backpropagate gradients through the sampling process $z \sim q_\phi(z|x)$ to update $\phi$ ? The act of sampling from $q_\phi(z|x)$ is a stochastic operation, and naively, it creates a non-differentiable point in our computation graph. This is where the reparameterization trick comes into play.

The Problem: Sampling Breaks Gradient Flow

Imagine the encoder network outputs the parameters of the distribution $q_\phi(z|x)$ , for instance, the mean $\mu_\phi(x)$ and standard deviation $\sigma_\phi(x)$ if $q_\phi(z|x)$ is Gaussian. Then, we sample $z$ from this distribution. The decoder takes this $z$ and tries to reconstruct $x$ .

If we try to backpropagate the reconstruction loss, the gradient needs to pass from the decoder, through $z$ , and then to the parameters $\mu_\phi(x)$ and $\sigma_\phi(x)$ of the encoder. The sampling step $z \sim q_\phi(z|x)$ itself is problematic because the operation of drawing a random sample is not directly differentiable with respect to the parameters of the distribution from which we are sampling. The path for gradient flow is effectively cut.

Consider a simplified view: $x \rightarrow \text{Encoder}(\phi) \rightarrow \text{params of } q_\phi(z|x) \xrightarrow{\text{Sample } z} \text{Decoder}(\theta) \rightarrow \log p_\theta(x|z)$ . The gradient $\nabla_\phi \log p_\theta(x|z)$ is what we need for the reconstruction term, but the sampling step blocks this.

The Solution: Separating Randomness from Parameters

The reparameterization trick provides an elegant solution by reframing the sampling process. Instead of sampling $z$ directly from $q_\phi(z|x)$ , we introduce an auxiliary noise variable $\epsilon$ sampled from a simple, fixed distribution $p(\epsilon)$ (e.g., a standard normal distribution $\mathcal{N}(0, I)$ ). Then, we express $z$ as a deterministic function $g_\phi(x, \epsilon)$ that transforms this noise $\epsilon$ using the parameters from the encoder.

So, $z = g_\phi(x, \epsilon)$ , where $\epsilon \sim p(\epsilon)$ . The key is that $g_\phi(x, \epsilon)$ is a deterministic function, and its parameters (derived from $\phi$ ) are part of this function. Now, the stochasticity is isolated to $\epsilon$ , which does not depend on $\phi$ .

This allows us to rewrite the expectation:

\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] = \mathbb{E}_{p(\epsilon)}[\log p_\theta(x|g_\phi(x, \epsilon))]

When we take a Monte Carlo estimate of this expectation using a single sample $\epsilon^{(l)} \sim p(\epsilon)$ , the term becomes $\log p_\theta(x|g_\phi(x, \epsilon^{(l)}))$ . Now, we can compute the gradient $\nabla_\phi \log p_\theta(x|g_\phi(x, \epsilon^{(l)}))$ because $g_\phi(x, \epsilon^{(l)})$ is a deterministic function of $\phi$ (given $x$ and $\epsilon^{(l)}$ ). The gradient can flow through the function $g$ back to the parameters $\phi$ .

Reparameterization for Gaussian Latent Variables

This trick is most commonly illustrated and used when $q_\phi(z|x)$ is a multivariate Gaussian distribution. Let's assume a diagonal covariance structure:

q_\phi(z|x) = \mathcal{N}(z | \mu_\phi(x), \text{diag}(\sigma^2_{\phi,1}(x), ..., \sigma^2_{\phi,J}(x)))

Here, the encoder network, parameterized by $\phi$ , outputs the mean vector $\mu_\phi(x)$ and the variance vector (or standard deviation vector) $\sigma^2_\phi(x)$ for each input $x$ .

To reparameterize, we first sample a noise vector $\epsilon \sim \mathcal{N}(0, I)$ , where $I$ is the identity matrix and $\epsilon$ has the same dimensionality as $z$ . Then, we can generate $z$ as:

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon

where $\sigma_\phi(x)$ is the vector of standard deviations (element-wise square root of variances) and $\odot$ denotes the element-wise product.

Now, $z$ is still a random variable distributed according to $q_\phi(z|x)$ , but it's expressed as a deterministic transformation of $\mu_\phi(x)$ , $\sigma_\phi(x)$ , and the independent noise $\epsilon$ . The gradients of the loss with respect to $\mu_\phi(x)$ and $\sigma_\phi(x)$ (and thus $\phi$ ) can be computed via standard backpropagation through this transformation.

In practice, encoders often output $\mu_\phi(x)$ and $\log \sigma^2_\phi(x)$ (log-variance). This is for numerical stability and to ensure that the variance $\sigma^2_\phi(x)$ is always positive. If the network outputs $\log \sigma^2_\phi(x)$ , then $\sigma_\phi(x) = \exp(0.5 \cdot \log \sigma^2_\phi(x))$ .

Visualizing the Gradient Flow

The following diagram illustrates how the reparameterization trick alters the computation graph to allow gradient flow.

The diagram shows the computational path. Before reparameterization, the sampling node for $z$ blocks gradient flow to encoder parameters $\phi$ . After reparameterization, $z$ is computed deterministically from encoder outputs and an independent noise source $\epsilon$ , enabling gradients to flow back to $\phi$ .

Why This Works: Pathwise Derivative Estimators

The reparameterization trick is an instance of what's known in the literature as a pathwise derivative estimator. The core idea is to move the differentiation operator inside the expectation. We want to compute:

\nabla_\phi \mathbb{E}_{q_\phi(z|x)}[f(z)]

where $f(z) = \log p_\theta(x|z)$ . If $z = g_\phi(x, \epsilon)$ with $\epsilon \sim p(\epsilon)$ , then the expectation becomes:

\nabla_\phi \mathbb{E}_{p(\epsilon)}[f(g_\phi(x, \epsilon))]

Since $p(\epsilon)$ does not depend on $\phi$ , and assuming $f$ and $g_\phi$ are sufficiently well-behaved (differentiable), we can swap the gradient and expectation operators:

\mathbb{E}_{p(\epsilon)}[\nabla_\phi f(g_\phi(x, \epsilon))]

Now, the gradient is inside the expectation. We can approximate this expectation using Monte Carlo sampling: draw $L$ samples $\epsilon^{(1)}, ..., \epsilon^{(L)}$ from $p(\epsilon)$ , and the gradient estimate is:

\frac{1}{L} \sum_{l=1}^{L} \nabla_\phi f(g_\phi(x, \epsilon^{(l)}))

In practice, for VAE training, we often use $L=1$ sample of $\epsilon$ per data point $x$ in a mini-batch. The term $\nabla_\phi f(g_\phi(x, \epsilon^{(l)}))$ can be computed using the chain rule:

\nabla_\phi f(g_\phi(x, \epsilon^{(l)})) = \frac{\partial f}{\partial z} \Big|_{z=g_\phi(x, \epsilon^{(l)})} \frac{\partial g_\phi(x, \epsilon^{(l)})}{\partial \phi}

This is precisely what automatic differentiation libraries (like those in TensorFlow or PyTorch) do during backpropagation.

Benefits and Applicability

The primary benefit of the reparameterization trick is that it allows for end-to-end training of VAEs using standard backpropagation algorithms. It typically results in gradient estimators with much lower variance compared to alternative methods for handling stochastic nodes, such as the score function estimator (also known as REINFORCE or the log-derivative trick, which is different from this pathwise derivative context). Lower variance gradients generally lead to more stable and faster training.

The reparameterization trick is applicable whenever we can express the random variable $z$ as a deterministic and differentiable transformation of its parameters and an independent noise source. This works well for many continuous distributions:

Gaussian: $z = \mu + \sigma \epsilon$ , with $\epsilon \sim \mathcal{N}(0,1)$ .
Uniform: If $z \sim U(a, b)$ , then $z = a + (b-a)\epsilon$ , with $\epsilon \sim U(0,1)$ .
Exponential: If $z \sim \text{Exp}(\lambda)$ , then $z = -\frac{1}{\lambda} \log \epsilon$ , with $\epsilon \sim U(0,1)$ .
Laplace: $z = \mu - b \cdot \text{sgn}(\epsilon-0.5) \cdot \log(1 - 2|\epsilon-0.5|)$ , with $\epsilon \sim U(0,1)$ .
Many other distributions, including Gamma, Beta, Cauchy, can also be reparameterized.

However, it's not universally applicable. For discrete latent variables, where $z$ takes on values from a discrete set, this direct reparameterization is not possible because the mapping from a continuous $\epsilon$ to a discrete $z$ would involve non-differentiable operations (like rounding or argmax). For such cases, other techniques like the Gumbel-Softmax trick (a continuous relaxation) or score function estimators are employed, though they come with their own sets of challenges.

Impact on ELBO Optimization

By applying the reparameterization trick, the reconstruction term $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ in the ELBO becomes differentiable with respect to the encoder parameters $\phi$ . The KL divergence term $D_{KL}(q_\phi(z|x) || p(z))$ is often analytically differentiable if $q_\phi(z|x)$ and $p(z)$ are chosen appropriately (e.g., both Gaussians). For example, if $p(z) = \mathcal{N}(0, I)$ and $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$ , the KL divergence is:

D_{KL}(q_\phi(z|x) || p(z)) = \frac{1}{2} \sum_{j=1}^{J} \left( \mu_{\phi,j}(x)^2 + \sigma^2_{\phi,j}(x) - \log(\sigma^2_{\phi,j}(x)) - 1 \right)

This expression is clearly differentiable with respect to $\mu_{\phi,j}(x)$ and $\sigma^2_{\phi,j}(x)$ . Thus, the entire ELBO can be optimized using gradient ascent (or gradient descent on the negative ELBO) with respect to both $\phi$ and $\theta$ .

In summary, the reparameterization trick is a fundamental technique in training VAEs. It cleverly sidesteps the issue of differentiating through a sampling process by reformulating the generation of latent variables $z$ . This makes the ELBO objective amenable to standard gradient-based optimization, allowing gradients to flow from the reconstruction loss back to the parameters of the encoder network. This innovation was a significant step in making VAEs practical and effective.

Was this section helpful?