As we move towards building autoencoders, which are neural network architectures trained through optimization, a solid grasp of certain mathematical concepts becomes indispensable. These tools form the language we use to describe how autoencoders learn representations, measure reconstruction quality, and manage the information flow within the network. This section revisits key ideas from probability, information theory, and optimization, framing them specifically for their application in the chapters ahead.
Autoencoders, especially variational autoencoders (VAEs), often operate within a probabilistic framework. Understanding basic probability helps interpret their behavior and design choices.
Random Variables and Distributions: Data points can often be viewed as realizations of random variables. An input image x, for instance, can be considered a sample from a high-dimensional data distribution Pdata(x). Autoencoders learn functions that transform these variables. We frequently encounter continuous random variables described by Probability Density Functions (PDFs), like the Gaussian (Normal) distribution, characterized by its mean μ and variance σ2. Its PDF is given by:
p(x;μ,σ2)=2πσ21exp(−2σ2(x−μ)2)Gaussian distributions are fundamental for VAEs, often serving as the prior distribution for latent variables and the form of the approximate posterior distribution learned by the encoder. Discrete random variables, described by Probability Mass Functions (PMFs), are also relevant, especially when dealing with binary data (Bernoulli distribution) or categorical outputs.
Expectation: The expected value, or mean, E[X] of a random variable X represents its average value, weighted by its probability. For a discrete variable X with PMF P(x), E[X]=∑xxP(x). For a continuous variable with PDF p(x), E[X]=∫xp(x)dx. Expectation is used extensively, for example, in defining loss functions like Mean Squared Error or in the sampling process within VAEs.
Bayes' Theorem: This theorem describes the probability of an event based on prior knowledge of conditions related to the event. It states:
P(A∣B)=P(B)P(B∣A)P(A)In the context of probabilistic models like VAEs, we are often interested in inferring latent variables z given observed data x, i.e., calculating the posterior probability P(z∣x). Bayes' theorem provides the theoretical basis for this, relating it to the likelihood P(x∣z) (how likely the data is given the latent variable, often modeled by the decoder), the prior P(z) (our belief about the latent variable before seeing data), and the evidence P(x) (the overall probability of the data). Calculating P(x) is often intractable, leading to the variational inference techniques used in VAEs.
Information theory provides tools to quantify information and uncertainty, which are central to representation learning and loss function design.
Entropy: Shannon entropy H(X) measures the average level of "information", "surprise", or "uncertainty" inherent in a random variable's possible outcomes. For a discrete random variable X with PMF P(x), it's defined as:
H(X)=−x∑P(x)logbP(x)The base b of the logarithm determines the units (bits for b=2, nats for b=e). Higher entropy means more uncertainty. In representation learning, we sometimes aim to create representations that minimize entropy while preserving relevant information.
Cross-Entropy: While entropy measures the uncertainty of a single distribution, cross-entropy H(P,Q) measures the average number of bits/nats needed to identify an event drawn from distribution P when coded using a scheme optimized for distribution Q.
H(P,Q)=−x∑P(x)logQ(x)(discrete) H(P,Q)=−∫P(x)logQ(x)dx(continuous)In machine learning, P is often the true underlying distribution (or the empirical distribution of the training data), and Q is the distribution predicted by our model. Minimizing cross-entropy between the model's output distribution and the target distribution is a standard way to train classification models and autoencoders, particularly when the output is interpreted probabilistically (e.g., Binary Cross-Entropy for pixel reconstruction in images with values between 0 and 1).
KL Divergence (Kullback-Leibler Divergence): KL divergence measures how one probability distribution P differs from a reference probability distribution Q. It's defined as the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using P.
DKL(P∣∣Q)=x∑P(x)logQ(x)P(x)(discrete) DKL(P∣∣Q)=∫P(x)logQ(x)P(x)dx(continuous)Importantly, KL divergence is non-negative (DKL(P∣∣Q)≥0) and equals zero if and only if P and Q are identical. It is not symmetric, meaning DKL(P∣∣Q)=DKL(Q∣∣P) generally. In VAEs (Chapter 4), the KL divergence plays a crucial role as a regularization term in the loss function. It typically measures the difference between the distribution learned by the encoder for the latent variables, Q(z∣x), and a chosen prior distribution, P(z) (often a standard Gaussian). Minimizing this term encourages the encoded distributions to resemble the prior, leading to a more structured and regular latent space.
KL divergence DKL(P∣∣Q) quantifies the difference from reference distribution Q to target distribution P. Minimizing it forces Q to become similar to P.
Training neural networks, including autoencoders, involves finding the parameters (weights and biases) that minimize an objective function, often called a loss function.
Objective Functions (Loss Functions): The loss function quantifies the difference between the model's output and the desired target. For standard autoencoders, the primary goal is reconstruction. Common reconstruction losses include:
Gradient Descent: This is the workhorse optimization algorithm for deep learning. It iteratively adjusts the model parameters θ to minimize the loss L(θ). In each step, parameters are updated in the direction opposite to the gradient of the loss function with respect to the parameters:
θt+1=θt−η∇θL(θt)Here, η is the learning rate, a hyperparameter controlling the step size. ∇θL(θt) is the gradient (vector of partial derivatives) of the loss with respect to each parameter in θ.
Stochastic Gradient Descent (SGD) and Mini-batches: Calculating the gradient over the entire dataset can be computationally expensive. SGD approximates the gradient using only a single data point or, more commonly, a small subset called a mini-batch. This makes updates faster and introduces noise that can help escape poor local minima. Most modern training uses mini-batch gradient descent.
Advanced Optimizers: While basic SGD works, more sophisticated algorithms often converge faster and more reliably. Optimizers like Adam, RMSprop, and Adagrad adapt the learning rate for each parameter based on the history of gradients. They are standard choices for training deep networks, including autoencoders. We will use these in practical implementations.
Backpropagation: This algorithm is the standard method for efficiently computing the gradient ∇θL in neural networks. It uses the chain rule of calculus to propagate the error gradient backward from the output layer through the network layers, calculating the gradient for each parameter. Deep learning frameworks like TensorFlow and PyTorch automate this process.
A firm understanding of these mathematical pillars, probability for modeling uncertainty and structure, information theory for quantifying information and defining divergences, and optimization for enabling learning, is essential as we proceed to design, implement, and analyze various autoencoder architectures. They provide the foundation upon which the concepts of encoding, decoding, latent spaces, and representation learning are built.
© 2025 ApX Machine Learning