While Coordinate Ascent Variational Inference (CAVI) provides analytical updates for the variational parameters λ when using conjugate models and the mean-field approximation, and Stochastic Variational Inference (SVI) scales this using stochastic gradients, both methods often rely on deriving model-specific, analytical expressions for the Evidence Lower Bound (ELBO) gradients. What happens when deriving these gradients analytically is difficult or even impossible? This situation arises frequently with:
- Non-conjugate models: When the prior and likelihood do not form a conjugate pair, the updates for the variational parameters in CAVI might not have a closed-form solution.
- Complex model structures: Models involving intricate dependencies, simulators, or functions within the generative process often make the joint probability p(x,z) difficult to differentiate analytically with respect to the variational parameters λ.
Black Box Variational Inference (BBVI) provides a general approach to VI that overcomes this limitation. It allows us to perform variational inference even when we cannot compute the gradient of the ELBO, ∇λL(λ), analytically. The core idea is to approximate this gradient using Monte Carlo estimation. BBVI treats the model itself as a "black box," requiring only the ability to evaluate the log joint probability logp(x,z) and the log variational density logq(z∣λ), along with their gradients under certain conditions.
Estimating the ELBO Gradient
Recall the ELBO:
L(λ)=Eq(z∣λ)[logp(x,z)−logq(z∣λ)]
We need to compute ∇λL(λ). The challenge lies in the fact that the expectation is taken with respect to q(z∣λ), which itself depends on λ.
The Score Function Gradient Estimator
One way to proceed is using the log-derivative trick, also known as the score function identity: ∇λq(z∣λ)=q(z∣λ)∇λlogq(z∣λ). Applying this to the ELBO gradient yields:
∇λL(λ)=∇λ∫q(z∣λ)[logp(x,z)−logq(z∣λ)]dz
=∫∇λq(z∣λ)[logp(x,z)−logq(z∣λ)]dz+∫q(z∣λ)∇λ[logp(x,z)−logq(z∣λ)]dz
Assuming the model's joint probability p(x,z) does not depend on the variational parameters λ, the gradient of logp(x,z) with respect to λ is zero. The expression simplifies after applying the log-derivative trick and further manipulation:
∇λL(λ)=Eq(z∣λ)[(∇λlogq(z∣λ))(logp(x,z)−logq(z∣λ))]
This expectation can be approximated using Monte Carlo samples: draw S samples z(s)∼q(z∣λ) and compute:
∇λL(λ)≈S1∑s=1S(∇λlogq(z(s)∣λ))(logp(x,z(s))−logq(z(s)∣λ))
This is the score function gradient estimator, sometimes called the REINFORCE estimator in reinforcement learning contexts. To use it, we only need to be able to sample from q(z∣λ), evaluate logp(x,z), logq(z∣λ), and compute the gradient of logq(z∣λ) with respect to λ.
A significant drawback of the score function estimator is that it often exhibits high variance, requiring many samples S for a stable estimate, which can slow down convergence.
The Reparameterization Trick (Pathwise Derivative Estimator)
An alternative, often lower-variance gradient estimator can be used when the variational distribution q(z∣λ) is amenable to the reparameterization trick. This involves expressing the random variable z as a deterministic and differentiable transformation g(⋅,λ) of an auxiliary random variable ϵ with a fixed distribution p(ϵ) that does not depend on λ.
z=g(ϵ,λ),where ϵ∼p(ϵ)
For example, if q(z∣λ) is a Gaussian distribution N(μ,σ2), where λ={μ,σ}, we can reparameterize z as z=μ+σϵ, where ϵ∼N(0,1).
With this reparameterization, we can rewrite the ELBO expectation:
L(λ)=Ep(ϵ)[logp(x,g(ϵ,λ))−logq(g(ϵ,λ)∣λ)]
Now, the expectation is with respect to p(ϵ), which is independent of λ. We can move the gradient inside the expectation:
∇λL(λ)=Ep(ϵ)[∇λ(logp(x,g(ϵ,λ))−logq(g(ϵ,λ)∣λ))]
The gradient can be approximated using Monte Carlo samples: draw S samples ϵ(s)∼p(ϵ), compute z(s)=g(ϵ(s),λ), and estimate:
∇λL(λ)≈S1∑s=1S∇λ(logp(x,z(s))−logq(z(s)∣λ))
Using the chain rule, ∇λf(z(s)) becomes (∇zf(z(s)))T(∇λg(ϵ(s),λ)). This requires gradients of the log joint and log variational density with respect to the latent variables z, and the gradient of the transformation g with respect to λ. Modern automatic differentiation libraries readily handle these computations.
The reparameterization trick typically yields gradient estimates with much lower variance compared to the score function method, often leading to faster and more stable convergence. However, it's not universally applicable; it requires finding a suitable function g(ϵ,λ).
Basic flow diagram illustrating the reparameterization trick within BBVI. Samples from a base distribution ϵ are transformed using variational parameters λ to get samples z. These are used to evaluate log probabilities, whose gradients with respect to λ form the ELBO gradient estimate, which is then fed into an optimizer to update λ.
The BBVI Algorithm
BBVI leverages these gradient estimators within a stochastic optimization framework, similar to SVI.
- Initialize variational parameters λ.
- Repeat until convergence:
a. Sample: Draw S Monte Carlo samples.
* If using the score function: Sample z(s)∼q(z∣λ) for s=1,...,S.
* If using reparameterization: Sample ϵ(s)∼p(ϵ) and compute z(s)=g(ϵ(s),λ) for s=1,...,S.
b. Estimate Gradient: Compute the Monte Carlo estimate g^ of ∇λL(λ) using either the score function or reparameterization formula. If performing SVI, use the gradient estimate for the current data mini-batch.
c. Update Parameters: Update λ using a stochastic optimization algorithm (e.g., Adam, RMSprop) with the estimated gradient g^:
λt+1←OptimizerUpdate(λt,g^)
When combined with data subsampling (as in SVI), step 2.b also incorporates stochasticity from the data mini-batch.
Advantages and Considerations
- Generality: BBVI's primary advantage is its applicability to a vast range of models, including those with non-conjugate structures or intractable analytic gradients. You only need to evaluate the model's log probabilities (and their gradients for reparameterization).
- Leverages Auto-Diff: Modern probabilistic programming languages (PPLs) like Pyro, NumPyro, TensorFlow Probability, and Stan heavily rely on automatic differentiation, making the implementation of BBVI (especially with the reparameterization trick) relatively straightforward for users. The PPL handles the gradient computations.
- Variance: The choice of gradient estimator is important. The score function estimator can suffer from high variance, potentially hindering convergence. The reparameterization trick, when applicable, is generally preferred for its lower variance. Advanced variance reduction techniques (e.g., control variates) can sometimes further improve the stability of both estimators.
- Tuning: Like other stochastic optimization methods, BBVI might require careful tuning of learning rates and optimizer parameters. The number of Monte Carlo samples S per gradient step also influences the trade-off between gradient accuracy and computational cost.
BBVI significantly broadens the scope of models amenable to variational inference. By replacing analytical gradient derivations with Monte Carlo estimation, it provides a powerful and flexible tool for approximate Bayesian inference, particularly when combined with the scalability of stochastic optimization and the convenience of automatic differentiation.