Information theory provides a powerful quantitative framework for understanding what constitutes an effective representation. This branch of mathematics allows us to measure uncertainty and information content, offering precise tools to analyze and design representation learning algorithms, including the VAEs that are central to this course. Understanding these principles will help clarify why certain objective functions are used and how we can evaluate the quality of learned representations.
At the core of information theory are a few fundamental quantities that help us reason about data and models.
Entropy, denoted as for a random variable , measures the average amount of uncertainty or "surprise" associated with the outcomes of . For a discrete random variable with probability mass function , its entropy is:
The logarithm is typically base 2, in which case entropy is measured in bits. A distribution that is sharply peaked (i.e., one outcome is very likely) has low entropy, while a uniform distribution (all outcomes equally likely) has maximum entropy for a given number of states. In representation learning, entropy can characterize the diversity or complexity of data features or latent variables.
Mutual Information (MI) is a measure of the amount of information that one random variable contains about another. For two random variables and , their mutual information quantifies the reduction in uncertainty about that results from knowing , or vice versa. It's defined as:
where is the conditional entropy of given . MI can also be expressed using KL divergence (discussed next):
This shows that MI measures the dependency between and . If and are independent, .
In representation learning, we are often interested in a latent representation that captures significant information about the input . Thus, a high is generally desirable. For instance, an encoder in an autoencoder aims to produce a that retains as much information as possible about to allow for accurate reconstruction. MI is also a foundation for understanding and promoting disentanglement, where we might want different components of a latent vector to be informative about distinct, independent factors of variation in the data, implying low for .
The Kullback-Leibler (KL) divergence, or relative entropy, measures how one probability distribution diverges from a second, expected probability distribution . For discrete distributions and defined over the same probability space , it's given by:
For continuous distributions, the sum is replaced by an integral. Important properties of KL divergence include:
In VAEs, as we'll see in detail in Chapter 2, the KL divergence plays a significant role. It often appears in the VAE objective function as a regularization term, encouraging the learned distribution of latent variables (the approximate posterior) to be close to a chosen prior distribution (e.g., a standard normal distribution). This regularization is important for ensuring that the latent space has good properties for generation.
The Information Bottleneck (IB) principle provides a formal framework for learning representations that are both compressed and informative. Given an input variable and a target variable (which could be class labels in a supervised task, or itself for reconstruction), the goal is to learn a stochastic mapping to a representation , , that acts as a "bottleneck." This should be maximally informative about while being minimally informative about .
This trade-off is formalized by the objective:
We aim to maximize this Lagrangian, where is a Lagrange multiplier that controls the trade-off between the informativeness of about and the compression of into .
The Information Bottleneck framework. The representation is learned to be a compressed version of input while retaining information relevant to a target .
The IB principle is highly relevant to VAEs. While not always explicitly framed this way, the VAE objective encourages learning a compressed latent representation (via the KL divergence term, which can be related to under certain conditions) that is sufficient for reconstructing (which relates to or where ). Understanding IB helps motivate the structure of VAE objectives and the properties desired in learned latent spaces.
As hinted, information-theoretic quantities are not just analytical tools; they are deeply embedded in the mechanics of VAEs.
Guiding the learning process, information theory provides tools for evaluating the quality of learned representations. For example, mutual information can be used to assess:
In summary, information theory offers a precise language and a set of tools to analyze the flow of information in probabilistic models like VAEs. It helps us understand what it means for a representation to be "good" (e.g., informative, compressed, disentangled) and provides mechanisms to build these properties into our models. This foundation will be valuable as we move into the mathematical specifics of VAEs and their advanced variants.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with