Semi-supervised learning (SSL) addresses a common scenario in machine learning: an abundance of unlabeled data alongside a much smaller, often expensive-to-acquire, set of labeled data. The objective is to leverage both data types to build models that outperform those trained on labeled data alone. Variational Autoencoders, with their inherent ability to learn rich data representations from unlabeled inputs, offer a powerful framework for SSL.
The core idea behind using VAEs in SSL is that the latent space z learned by the VAE from all available data (both labeled XL and unlabeled XU) can provide a more structured and informative representation for a subsequent classification task than the raw input data x. This is particularly advantageous when labeled data is scarce.
A prevalent and effective strategy involves a VAE that learns a mapping from x to z and back to x, coupled with a classifier that operates on these learned latent representations.
VAE for Representation Learning: The VAE, consisting of an encoder qϕ(z∣x) and a decoder pθ(x∣z), is trained on the entire dataset (XL∪XU). Its objective is to maximize the Evidence Lower Bound (ELBO):
LVAE(x)=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z))This process forces the encoder to capture salient features of the input data within the latent space z. The prior p(z) is typically a standard Gaussian N(0,I).
Classifier on Latent Space: A separate classification network, fψ(y∣z), is trained to predict labels y using the latent codes zL derived from the labeled portion of the data, XL. The classification loss, LC, is commonly the cross-entropy between the true labels yL and the predicted labels y^L=fψ(zL).
LC(yL,y^L)=−i∑yL,ilogy^L,iThese two components are often trained jointly. The total objective function combines the VAE's ELBO (applied to all data) and the classification loss (applied only to labeled data), weighted by a hyperparameter γ:
JSSL=x∈XL∪XU∑(Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣p(z)))+γ(xl,yl)∈DL∑LC(yl,fψ(qϕ(z∣xl)))The hyperparameter γ balances the contribution of the generative (VAE) task and the discriminative (classification) task. A diagram illustrating this common architecture is shown below.
A common architecture for semi-supervised learning with VAEs. The VAE processes both labeled and unlabeled data to learn representations. A classifier then uses these representations from labeled data for supervised training.
More deeply integrated approaches, such as the M1 and M2 models proposed by Kingma et al. (2014), treat y as a random variable within the generative model itself. For instance, the M2 model posits a generative process like p(y)→p(z∣y)→p(x∣z). The label y influences the generation of latent variable z, which in turn generates x.
The objective function for such models becomes more involved. For labeled data (xl,yl), the ELBO is formulated to reflect logp(xl,yl). For unlabeled data xu, the model must marginalize over the unknown labels. This is typically achieved by introducing an auxiliary inference network qψ(y∣xu) that predicts a distribution over labels for unlabeled samples. The ELBO for unlabeled data then involves an expectation over these predicted label distributions:
Lunsup(xu)=k∑qψ(y=k∣xu)(Eqϕ(z∣xu)[logpθ(xu∣z)+logp(z∣y=k)+logp(y=k)]−DKL(qϕ(z∣xu)∣∣p(z∣y=k)))+H(qψ(y∣xu))The term p(z∣y=k) is a class-conditional prior for the latent variables, and H(qψ(y∣xu)) is an entropy term that encourages confident predictions from qψ(y∣xu) when appropriate. The classifier qψ(y∣x) is also trained discriminatively using the labeled data, often via an additional term in the total loss function, such as:
Lclassifier_sup=(xl,yl)∈DL∑−logqψ(yl∣xl)These integrated models can, in theory, better capture the underlying joint distribution p(x,y), potentially leading to improved generative and classification performance. However, they also introduce greater model complexity and more intricate training dynamics.
Despite these challenges, VAEs provide a versatile and principled framework for semi-supervised learning. By effectively combining the VAE's capacity for unsupervised representation learning with supervised signals from limited labeled data, these models can achieve strong performance on tasks where labeled data is a bottleneck. The choice between a simpler feature-extraction approach and a more integrated generative model often depends on the specific problem, the amount of available data, and computational resources.
Was this section helpful?
© 2025 ApX Machine Learning