While Coordinate Ascent Variational Inference (CAVI) provides a systematic way to optimize the Evidence Lower Bound (ELBO), it requires iterating through the entire dataset to update each parameter. For modern datasets, which can contain millions or even billions of data points, this full-dataset pass becomes a significant computational bottleneck. Imagine needing to process terabytes of data just to make a single adjustment to your model's parameters; the process would be prohibitively slow.
Stochastic Variational Inference (SVI) addresses this scalability challenge by borrowing ideas from stochastic optimization, particularly Stochastic Gradient Descent (SGD). Instead of calculating the true gradient of the ELBO using the full dataset, SVI uses noisy but computationally cheap gradients estimated from small, randomly selected subsets of the data, often called minibatches.
Recall the ELBO: L(q)=Eq(z)[logp(x,z)]−Eq(z)[logq(z)] Assuming the data points x={x1,...,xN} are conditionally independent given the latent variables z, the joint probability often decomposes: logp(x,z)=logp(z)+∑n=1Nlogp(xn∣zn). Let's simplify and consider models where z includes global parameters β shared across all data points, and local latent variables zn specific to each data point xn. The variational distribution q also factors accordingly, often q(β,z1:N)=q(β;λ)∏n=1Nq(zn;ϕn), where λ and ϕn are the variational parameters for global and local variables, respectively.
The ELBO can then be written involving a sum over data points: L(λ,ϕ1:N)=Eq[logp(β)]+∑n=1NEq[logp(xn∣zn,β)]−Eq[logq(β;λ)]−∑n=1NEq[logq(zn;ϕn)] CAVI requires optimizing this full objective. SVI, however, focuses on the global parameters λ. To update λ, we need the gradient ∇λL. Notice that the summation term ∑n=1NEq[logp(xn∣zn,β)] prevents easy calculation without touching all data.
SVI's core idea is to approximate the full gradient using just a minibatch. If we sample a data point xn uniformly at random, we can construct an unbiased stochastic estimate of the gradient related to the summation term. By sampling a minibatch M of indices from {1,...,N}, we can form a noisy gradient estimate: ∇λL^M≈∇λEq[logp(β)]+∣M∣N∑n∈M∇λEq[logp(xn∣zn,β)]−∇λEq[logq(β;λ)] The scaling factor N/∣M∣ corrects for the fact that we are summing over a smaller set M instead of the full dataset N, ensuring the stochastic gradient remains an unbiased estimator of the true gradient (up to dependencies involving expectations over q).
SVI proceeds iteratively, updating global and local variational parameters using these stochastic gradients. At each step t:
This process repeats, cycling through minibatches of data. The learning rate ρt typically decreases over iterations to ensure convergence.
A conceptual view of an SVI update step. A small minibatch is sampled from the large dataset. This minibatch, along with the current global parameters λ, is used to calculate a noisy gradient, which then updates λ.
The choice of the learning rate schedule ρt is important for SVI's performance. For the stochastic updates to converge properly, the learning rates must satisfy the Robbins-Monro conditions: ∑t=1∞ρt=∞and∑t=1∞ρt2<∞ A common choice is a polynomial decay schedule: ρt=(τ0+t)−κ where κ∈(0.5,1] controls the decay rate, and τ0≥0 down-weights early iterations. Tuning κ and τ0, along with the minibatch size ∣M∣, often requires experimentation. Too large a learning rate can lead to instability, while too small a rate can result in slow convergence.
Standard gradient ascent updates the parameters λ in the direction of steepest ascent in the Euclidean geometry of the parameter space. However, variational parameters often define probability distributions, and the Euclidean geometry might not be the most appropriate. The space of distributions has its own geometry, captured by the Fisher information matrix F(λ).
Natural gradients modify the update direction by pre-multiplying the standard gradient with the inverse of the Fisher information matrix: ∇~λL=F(λ)−1∇λL The SVI update rule becomes: λ(t+1)←λ(t)+ρt∇~λL^Mt For variational distributions belonging to the exponential family (a common choice), calculating the natural gradient can sometimes be simpler than the standard gradient and often leads to significantly faster convergence because it accounts for the information geometry of the variational parameter space.
Advantages:
Considerations:
SVI provides a powerful tool for applying Bayesian inference to large-scale problems where traditional methods become computationally infeasible. It forms the backbone of many modern probabilistic modeling applications, particularly in areas like topic modeling (e.g., Latent Dirichlet Allocation on large text corpora) and is a precursor to techniques used in Bayesian deep learning. While MCMC methods might offer asymptotically exact samples, and CAVI provides deterministic updates, SVI strikes a practical balance, enabling approximate Bayesian inference at scale by leveraging the power of stochastic optimization.
© 2025 ApX Machine Learning