Variational Autoencoders, by their very nature of learning a compressed representation and a generative model of the training data, offer a powerful framework for identifying samples that deviate from the norm. This capability makes them well-suited for anomaly detection and out-of-distribution (OOD) detection tasks. The core assumption is that a VAE trained on "normal" data will be proficient at reconstructing familiar samples while struggling with anomalous or OOD inputs.
The most straightforward approach to anomaly detection with VAEs is based on the reconstruction probability. A VAE is trained to minimize the reconstruction error for data points similar to its training distribution. The Evidence Lower Bound (ELBO), which VAEs maximize, includes a reconstruction term, typically a log-likelihood such as:
Lrec(x,x^)=Eqϕ(z∣x)[logpθ(x∣z)]where x is the input and x^ is its reconstruction through the VAE (i.e., z∼qϕ(z∣x), x^∼pθ(x∣z)).
For an input xnew, its reconstruction error (e.g., Mean Squared Error for continuous data, or negative log-likelihood of pθ(xnew∣z) for z sampled from qϕ(z∣xnew)) can serve as an anomaly score.
AnomalyScore(xnew)=∣∣xnew−x^new∣∣2or more generally, −logpθ(xnew∣z^), where z^ is the mean of qϕ(z∣xnew). Samples that are "normal" or in-distribution should have low reconstruction errors, while anomalous or OOD samples are expected to yield higher reconstruction errors because the VAE's decoder pθ(x∣z) hasn't been optimized to generate them accurately from any latent code z produced by the encoder qϕ(z∣x).
Reconstruction error scores for normal data tend to be clustered at lower values, while anomalous data points typically exhibit higher scores, allowing for separation.
Beyond reconstruction, the latent space z learned by the VAE provides another avenue for identifying anomalies.
Distance from Prior: The encoder qϕ(z∣x) maps inputs to distributions in the latent space. For in-distribution data, these encoded distributions are regularized by the KL divergence term DKL(qϕ(z∣x)∣∣pθ(z)) to be close to the prior pθ(z) (often a standard Normal distribution N(0,I)). Anomalous inputs might be mapped to regions of the latent space that are far from where pθ(z) has high density. The KL divergence value itself, or the Mahalanobis distance of the mean of qϕ(z∣x) from the prior's mean, can be used as an anomaly score.
Density in Latent Space: Normal data points are expected to map to denser regions of the learned latent manifold. Anomalies might fall into sparser, less populated areas. One could estimate the density around z=μϕ(x) using techniques like Kernel Density Estimation (KDE) on latent codes from a validation set of normal data, though this can be computationally intensive and sensitive to dimensionality.
Normal data (blue points) typically cluster within or near the region favored by the latent prior (dashed ellipse). Anomalous data (red points A1, A2, A3) may be encoded to locations far from this central mass or into sparser areas of the latent space.
The ELBO itself, L(x;θ,ϕ), is an approximation of the log marginal likelihood logp(x).
L(x;θ,ϕ)=Eqϕ(z∣x)[logpθ(x∣z)]−DKL(qϕ(z∣x)∣∣pθ(z))A lower ELBO value could indicate an OOD sample. However, relying solely on ELBO can sometimes be misleading. Highly flexible decoders might assign high likelihood (and thus high ELBO) to simple OOD samples that are easy to reconstruct, even if they are semantically different from the training data. Some research suggests that the typical set for p(x) (where VAEs perform well) and the high-density regions of p(x) might not perfectly align for complex data distributions, which can complicate OOD detection using raw likelihoods.
Combining metrics, such as a weighted sum of the negative reconstruction error and the KL divergence, often yields more robust anomaly scores:
AnomalyScore(x)=α⋅(−Eqϕ(z∣x)[logpθ(x∣z)])+(1−α)⋅DKL(qϕ(z∣x)∣∣pθ(z))The hyperparameter α balances the two terms and usually requires tuning.
Once an anomaly scoring mechanism is in place (e.g., reconstruction error, latent KL divergence, or a combined score), a threshold must be determined. Any sample with a score exceeding this threshold is flagged as an anomaly. Common strategies for threshold selection include:
While VAEs offer a principled way for anomaly detection, several factors can influence their performance:
More advanced VAE-based anomaly detection might involve:
In summary, VAEs provide a versatile toolkit for anomaly and OOD detection by learning a model of normal data. The choice of anomaly score, whether based on reconstruction error, latent space properties, or the ELBO itself, along with careful model design and threshold selection, are significant for achieving effective performance in practice. Understanding these nuances allows engineers and researchers to adapt VAEs to a wide array of detection scenarios.
Was this section helpful?
© 2025 ApX Machine Learning