Standard Variational Autoencoders, as we've seen, provide a powerful framework for learning latent representations and generating new data points by sampling from the prior distribution p(z) and passing the sample through the decoder pθ(x∣z). However, this generation process is typically unconditional. We sample a z and get an x, but we lack fine-grained control over what kind of x is generated. Imagine training a VAE on images of handwritten digits (0-9). While it might generate realistic-looking digits, we cannot directly ask it to generate, say, only the digit '7'.
Conditional Variational Autoencoders (CVAEs) extend the VAE framework to address this limitation by incorporating conditional information, often denoted as y, into the modeling process. This variable y can represent labels, attributes, or any other side information relevant to the data x. By conditioning both the encoder and decoder on y, CVAEs allow us to control the generation process.
The core idea behind CVAEs is to make both the inference (encoding) and generative (decoding) processes dependent on the conditional variable y.
The objective function for a CVAE is derived similarly to the standard VAE, but incorporates the condition y. We aim to maximize the conditional log-likelihood logpθ(x∣y). The corresponding Evidence Lower Bound (ELBO) becomes:
LCVAE(x,y;θ,ϕ)=Eqϕ(z∣x,y)[logpθ(x∣z,y)]−DKL(qϕ(z∣x,y)∣∣p(z∣y))Let's break down this objective:
A common simplification is to assume the prior distribution over the latent variables is independent of the condition y, meaning p(z∣y)=p(z). In many practical applications, p(z) is chosen to be a standard multivariate Gaussian, N(0,I). Under this assumption, the KL divergence term becomes DKL(qϕ(z∣x,y)∣∣p(z)). The ELBO simplifies to:
LCVAE(x,y;θ,ϕ)=Eqϕ(z∣x,y)[logpθ(x∣z,y)]−DKL(qϕ(z∣x,y)∣∣p(z))Maximizing this ELBO trains the encoder and decoder networks (ϕ and θ) to reconstruct inputs accurately while ensuring the conditional latent space structure aligns with the simple prior p(z).
Integrating the condition y into the neural networks of the encoder and decoder is typically straightforward. If y is categorical (like a digit label), it's often converted to a one-hot vector or an embedding vector. This vector representation of y is then concatenated with the other inputs to the respective networks:
The following diagram illustrates the data flow in a CVAE:
Data flow in a Conditional Variational Autoencoder. The condition
y
is provided as input to both the encoder and the decoder networks, enabling controlled generation and representation learning.
Once the CVAE is trained, generating a sample x corresponding to a specific condition y is direct:
For instance, using a CVAE trained on MNIST, you could generate an image of the digit '3' by providing the one-hot vector for '3' as y along with a random sample z to the decoder. By varying z while keeping y fixed, you can generate different stylistic variations of the digit '3'.
CVAEs open up possibilities for controlled generation in various domains:
By allowing external information to guide the generative process, CVAEs provide a significant enhancement over standard VAEs for tasks requiring targeted output synthesis. They represent a key step towards more controllable and versatile generative models.
© 2025 ApX Machine Learning