As we continue to build our understanding of what constitutes an effective representation, information theory provides a powerful quantitative framework. This branch of mathematics allows us to measure uncertainty and information content, offering precise tools to analyze and design representation learning algorithms, including the VAEs that are central to this course. Understanding these principles will help clarify why certain objective functions are used and how we can evaluate the quality of learned representations.
At the heart of information theory are a few fundamental quantities that help us reason about data and models.
Entropy, denoted as H(X) for a random variable X, measures the average amount of uncertainty or "surprise" associated with the outcomes of X. For a discrete random variable X with probability mass function P(x), its entropy is:
H(X)=−x∈X∑P(x)log2P(x)The logarithm is typically base 2, in which case entropy is measured in bits. A distribution that is sharply peaked (i.e., one outcome is very likely) has low entropy, while a uniform distribution (all outcomes equally likely) has maximum entropy for a given number of states. In representation learning, entropy can characterize the diversity or complexity of data features or latent variables.
Mutual Information (MI) is a measure of the amount of information that one random variable contains about another. For two random variables X and Z, their mutual information I(X;Z) quantifies the reduction in uncertainty about X that results from knowing Z, or vice versa. It's defined as:
I(X;Z)=H(X)−H(X∣Z)=H(Z)−H(Z∣X)where H(X∣Z) is the conditional entropy of X given Z. MI can also be expressed using KL divergence (discussed next):
I(X;Z)=DKL(P(x,z)∣∣P(x)P(z))This shows that MI measures the dependency between X and Z. If X and Z are independent, I(X;Z)=0.
In representation learning, we are often interested in a latent representation Z that captures significant information about the input X. Thus, a high I(X;Z) is generally desirable. For instance, an encoder in an autoencoder aims to produce a Z that retains as much information as possible about X to allow for accurate reconstruction. MI is also a cornerstone for understanding and promoting disentanglement, where we might want different components of a latent vector Z=(Z1,...,Zd) to be informative about distinct, independent factors of variation in the data, implying low I(Zi;Zj) for i=j.
The Kullback-Leibler (KL) divergence, or relative entropy, measures how one probability distribution P diverges from a second, expected probability distribution Q. For discrete distributions P and Q defined over the same probability space X, it's given by:
DKL(P∣∣Q)=x∈X∑P(x)logQ(x)P(x)For continuous distributions, the sum is replaced by an integral. Key properties of KL divergence include:
In VAEs, as we'll see in detail in Chapter 2, the KL divergence plays a significant role. It often appears in the VAE objective function as a regularization term, encouraging the learned distribution of latent variables q(z∣x) (the approximate posterior) to be close to a chosen prior distribution p(z) (e.g., a standard normal distribution). This regularization is important for ensuring that the latent space has good properties for generation.
The Information Bottleneck (IB) principle provides a formal framework for learning representations that are both compressed and informative. Given an input variable X and a target variable Y (which could be class labels in a supervised task, or X itself for reconstruction), the goal is to learn a stochastic mapping to a representation Z, p(z∣x), that acts as a "bottleneck." This Z should be maximally informative about Y while being minimally informative about X.
This trade-off is formalized by the objective:
LIB=I(Z;Y)−βI(X;Z)We aim to maximize this Lagrangian, where β is a Lagrange multiplier that controls the trade-off between the informativeness of Z about Y and the compression of X into Z.
The Information Bottleneck framework. The representation Z is learned to be a compressed version of input X while retaining information relevant to a target Y.
The IB principle is highly relevant to VAEs. While not always explicitly framed this way, the VAE objective encourages learning a compressed latent representation Z (via the KL divergence term, which can be related to I(X;Z) under certain conditions) that is sufficient for reconstructing X (which relates to I(X;Z) or I(Z;X) where Y=X). Understanding IB helps motivate the structure of VAE objectives and the properties desired in learned latent spaces.
As hinted, information-theoretic quantities are not just analytical tools; they are deeply embedded in the mechanics of VAEs.
Beyond guiding the learning process, information theory provides tools for evaluating the quality of learned representations. For example, mutual information can be used to assess:
In summary, information theory offers a precise language and a set of tools to analyze the flow of information in probabilistic models like VAEs. It helps us understand what it means for a representation to be "good" (e.g., informative, compressed, disentangled) and provides mechanisms to build these properties into our models. This foundation will be valuable as we move into the mathematical specifics of VAEs and their advanced variants.
Was this section helpful?
© 2025 ApX Machine Learning