Coordinate Ascent Variational Inference (CAVI) is a technique for optimizing the Evidence Lower Bound (ELBO) in probabilistic models. A limitation of CAVI, however, is its requirement to iterate through the entire dataset to update each model parameter. For modern datasets, which can contain millions or even billions of data points, this full-dataset pass becomes a significant computational bottleneck. Imagine needing to process terabytes of data just to make a single adjustment to your model's parameters; the process would be prohibitively slow.
Stochastic Variational Inference (SVI) addresses this scalability challenge by borrowing ideas from stochastic optimization, particularly Stochastic Gradient Descent (SGD). Instead of calculating the true gradient of the ELBO using the full dataset, SVI uses noisy but computationally cheap gradients estimated from small, randomly selected subsets of the data, often called minibatches.
From Batch Gradients to Stochastic Gradients
Recall the ELBO:
L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)]
Assuming the data points x={x1,...,xN} are conditionally independent given the latent variables z, the joint probability often decomposes: logp(x,z)=logp(z)+∑n=1Nlogp(xn∣zn). Let's simplify and consider models where z includes global parameters β shared across all data points, and local latent variables zn specific to each data point xn. The variational distribution q also factors accordingly, often q(β,z1:N)=q(β;λ)∏n=1Nq(zn;ϕn), where λ and ϕn are the variational parameters for global and local variables, respectively.
The ELBO can then be written involving a sum over data points:
L(λ,ϕ1:N)=Eq[logp(β)]+∑n=1NEq[logp(xn∣zn,β)]−Eq[logq(β;λ)]−∑n=1NEq[logq(zn;ϕn)]
CAVI requires optimizing this full objective. SVI, however, focuses on the global parameters λ. To update λ, we need the gradient ∇λL. Notice that the summation term ∑n=1NEq[logp(xn∣zn,β)] prevents easy calculation without touching all data.
SVI's core idea is to approximate the full gradient using just a minibatch. If we sample a data point xn uniformly at random, we can construct an unbiased stochastic estimate of the gradient related to the summation term. By sampling a minibatch M of indices from {1,...,N}, we can form a noisy gradient estimate:
∇λL^M≈∇λEq[logp(β)]+∣M∣N∑n∈M∇λEq[logp(xn∣zn,β)]−∇λEq[logq(β;λ)]
The scaling factor N/∣M∣ corrects for the fact that we are summing over a smaller set M instead of the full dataset N, ensuring the stochastic gradient remains an unbiased estimator of the true gradient (up to dependencies involving expectations over q).
The SVI Update Algorithm
SVI proceeds iteratively, updating global and local variational parameters using these stochastic gradients. At each step t:
- Sample Minibatch: Randomly select a small subset (minibatch) Mt of data points from the full dataset.
- Update Local Parameters: For each data point xn in the minibatch Mt, update its corresponding local variational parameters ϕn to optimize its contribution to the ELBO, holding the current global parameters λ(t) fixed. This might involve one or more CAVI-like updates specific to that data point.
ϕn(t+1)←argmaxϕnEq[logp(xn∣zn,β)]−Eq[logq(zn;ϕn)]
(where expectations are taken holding λ(t) and other ϕm=n fixed).
- Compute Stochastic Gradient: Calculate the noisy gradient ∇λL^Mt of the ELBO with respect to the global parameters λ, using the updated local parameters ϕn(t+1) for n∈Mt.
- Update Global Parameters: Update the global variational parameters λ using a gradient ascent step with a learning rate ρt:
λ(t+1)←λ(t)+ρt∇λL^Mt
This process repeats, cycling through minibatches of data. The learning rate ρt typically decreases over iterations to ensure convergence.
A view of an SVI update step. A small minibatch is sampled from the large dataset. This minibatch, along with the current global parameters λ, is used to calculate a noisy gradient, which then updates λ.
Learning Rates and Convergence
The choice of the learning rate schedule ρt is important for SVI's performance. For the stochastic updates to converge properly, the learning rates must satisfy the Robbins-Monro conditions:
∑t=1∞ρt=∞and∑t=1∞ρt2<∞
A common choice is a polynomial decay schedule:
ρt=(τ0+t)−κ
where κ∈(0.5,1] controls the decay rate, and τ0≥0 down-weights early iterations. Tuning κ and τ0, along with the minibatch size ∣M∣, often requires experimentation. Too large a learning rate can lead to instability, while too small a rate can result in slow convergence.
Natural Gradients for Faster Convergence
Standard gradient ascent updates the parameters λ in the direction of steepest ascent in the Euclidean geometry of the parameter space. However, variational parameters often define probability distributions, and the Euclidean geometry might not be the most appropriate. The space of distributions has its own geometry, captured by the Fisher information matrix F(λ).
Natural gradients modify the update direction by pre-multiplying the standard gradient with the inverse of the Fisher information matrix:
∇~λL=F(λ)−1∇λL
The SVI update rule becomes:
λ(t+1)←λ(t)+ρt∇~λL^Mt
For variational distributions belonging to the exponential family (a common choice), calculating the natural gradient can sometimes be simpler than the standard gradient and often leads to significantly faster convergence because it accounts for the information geometry of the variational parameter space.
Advantages of SVI
Advantages:
- Scalability: SVI's primary advantage is its ability to handle massive datasets that do not fit into memory, as it only processes small minibatches at each step.
- Speed: It often converges much faster in terms of wall-clock time compared to batch VI (CAVI) or MCMC methods, especially early in the optimization process.
- Online Learning: SVI can naturally incorporate new data points as they arrive without needing to retrain on the entire dataset.
Considerations:
- Tuning: Requires careful tuning of the learning rate schedule and minibatch size.
- Noisy Gradients: The stochastic nature of the gradients introduces noise, which can slow down convergence in later stages compared to batch methods or lead to oscillations around the optimum.
- Approximation Quality: Like all VI methods, SVI finds an approximation within the chosen variational family, which might not perfectly capture the true posterior. The quality depends on the flexibility of the family q.
SVI provides a powerful tool for applying Bayesian inference to large-scale problems where traditional methods become computationally infeasible. It forms the backbone of many modern probabilistic modeling applications, particularly in areas like topic modeling (e.g., Latent Dirichlet Allocation on large text corpora) and is a precursor to techniques used in Bayesian deep learning. While MCMC methods might offer asymptotically exact samples, and CAVI provides deterministic updates, SVI strikes a practical balance, enabling approximate Bayesian inference at scale by leveraging the power of stochastic optimization.