Hands-on Practical: Building a BNN

Alright, let's put the theory of Bayesian Neural Networks into practice. In this section, we'll build, train, and evaluate a simple BNN using Variational Inference (VI). We'll focus on a regression task, which allows for intuitive visualization of the model's predictions and its associated uncertainty. We will use TensorFlow Probability (TFP), a library that integrates probabilistic reasoning and statistical analysis with TensorFlow.

You should have TensorFlow and TensorFlow Probability installed. If not, you can typically install them using pip:

pip install tensorflow tensorflow-probability matplotlib numpy

Setting Up the Environment and Data

First, let's import the necessary libraries and generate some synthetic data for our regression problem. We'll create data where the relationship between the input $x$ and output $y$ is non-linear, with some added noise. This noise represents the aleatoric uncertainty.

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# For reproducibility
np.random.seed(42)
tf.random.set_seed(42)

tfd = tfp.distributions
tfk = tf.keras
tfkl = tf.keras.layers
tfpl = tfp.layers

# Generate synthetic data
def generate_data(n_samples=100, noise_std=0.1):
    X = np.linspace(-3, 3, n_samples).astype(np.float32).reshape(-1, 1)
    # Non-linear function with noise
    y = X * np.sin(X * 2) + np.random.normal(0, noise_std, size=(n_samples, 1)).astype(np.float32)
    return X, y

X_train, y_train = generate_data(n_samples=150, noise_std=0.2)
X_test = np.linspace(-4, 4, 200).astype(np.float32).reshape(-1, 1)

# Visualize the training data
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_train.flatten(), y=y_train.flatten(), mode='markers', name='Training Data', marker=dict(color='#1f77b4', size=6)))
fig.update_layout(
    title='Synthetic Regression Data',
    xaxis_title='Input (x)',
    yaxis_title='Output (y)',
    template='plotly_white',
    legend_title_text='Data'
)
# fig.show() # Use this in a Python environment to display

The training data follows the pattern $y \approx x \sin(2x)$ with added Gaussian noise.

Defining the Bayesian Neural Network

Now, we'll define our BNN using the Keras functional API and TFP layers. Specifically, we use tfp.layers.DenseVariational. This layer represents a densely-connected neural network layer where weights and biases are distributions (our approximate posterior $q(w)$ ) rather than point estimates.

During training, this layer adds a KL divergence term to the model's loss. This term measures the difference between the learned approximate posterior $q(w)$ and the prior $p(w)$ . The layer automatically handles the sampling needed for the forward pass and the calculation of this KL term as part of the VI objective (ELBO maximization, or equivalently, negative ELBO minimization).

We need to specify:

Prior Distribution: The distribution $p(w)$ representing our beliefs about the weights before seeing data. A standard choice is an isotropic Gaussian (Normal) distribution centered at zero.
Posterior Approximation: The family of distributions $q(w)$ used to approximate the true posterior $p(w|\mathcal{D})$ . A common choice is a factorized (mean-field) Gaussian distribution.
KL Divergence Calculation Function: How to compute $KL[q(w) || p(w)]$ . TFP provides utilities for this.

# Define the prior distribution for weights and biases
def prior_fn(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    prior_model = tfk.Sequential([
        tfpl.VariableLayer(tfpl.IndependentNormal.params_size(n), dtype=dtype),
        tfpl.IndependentNormal(n, convert_to_tensor_fn=tfd.Distribution.sample)
    ])
    return prior_model

# Define the posterior approximation strategy (mean-field Gaussian)
def posterior_fn(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    posterior_model = tfk.Sequential([
        tfpl.VariableLayer(tfpl.IndependentNormal.params_size(n), dtype=dtype),
        tfpl.IndependentNormal(n, convert_to_tensor_fn=tfd.Distribution.sample)
    ])
    return posterior_model

# Build the BNN model
def create_bnn_model(train_size):
    inputs = tfkl.Input(shape=(1,))
    hidden = tfpl.DenseVariational(
        units=32,
        make_prior_fn=prior_fn,
        make_posterior_fn=posterior_fn,
        kl_weight=1/train_size, # Scale KL divergence by dataset size
        activation='relu'
    )(inputs)
    hidden = tfpl.DenseVariational(
        units=16,
        make_prior_fn=prior_fn,
        make_posterior_fn=posterior_fn,
        kl_weight=1/train_size,
        activation='relu'
    )(hidden)
    # Output layer: Predicting mean of a Normal distribution
    # We model the output y as y ~ Normal(loc=f(x), scale=sigma)
    # Here, f(x) is the output of the DenseVariational layer
    # We'll use a fixed standard deviation (sigma) for simplicity,
    # effectively using Mean Squared Error as the negative log-likelihood.
    # Alternatively, another output head could predict sigma (aleatoric uncertainty).
    output_mean = tfpl.DenseVariational(
        units=1, # Predicting the mean parameter
        make_prior_fn=prior_fn,
        make_posterior_fn=posterior_fn,
        kl_weight=1/train_size
        # No activation for regression output mean
    )(hidden)

    # For simplicity, we use MSE loss, corresponding to a fixed Gaussian likelihood std dev.
    # A more complete BNN might also predict the std dev (scale).
    # Example: output_scale = tfpl.DenseVariational(...) -> tf.exp(output_scale_raw)
    # Then use tfp.layers.IndependentNormal(1) as the final layer.

    model = tfk.Model(inputs=inputs, outputs=output_mean)
    return model

bnn_model = create_bnn_model(train_size=len(X_train))
bnn_model.summary()

We scale the KL divergence term by 1 / train_size. This is common practice in VI for BNNs, balancing the data fit (likelihood) term and the regularization (KL divergence) term in the objective function.

Defining the Loss Function and Training

For VI, the objective is to maximize the Evidence Lower Bound (ELBO), which is equivalent to minimizing the negative ELBO. The negative ELBO can be written as:

-\text{ELBO} = -\mathbb{E}_{q(w)}[\log p(\mathcal{D}|w)] + KL[q(w) || p(w)]

The first term is the expected negative log-likelihood of the data given the parameters sampled from the approximate posterior. The second term is the KL divergence between the approximate posterior and the prior.

When using Keras with DenseVariational, the KL divergence term is automatically added to the model's loss. We only need to specify the negative log-likelihood term as our main loss function. For regression with assumed Gaussian noise (constant variance), the negative log-likelihood is proportional to the Mean Squared Error (MSE).

# Define the negative log-likelihood loss function (MSE for Gaussian likelihood)
def nll_loss(y_true, y_pred_distribution):
    # For DenseVariational, y_pred_distribution is just the predicted mean here.
    # A more complete model would output a tfd.Distribution.
    # return -y_pred_distribution.log_prob(y_true) # If output layer was tfp.layers.IndependentNormal
    return tf.reduce_mean(tf.square(y_true - y_pred_distribution))

# Compile the model
optimizer = tfk.optimizers.Adam(learning_rate=0.01)
bnn_model.compile(optimizer=optimizer, loss=nll_loss) # Keras adds KL divergence automatically

# Train the model
print("Starting training...")
history = bnn_model.fit(X_train, y_train, epochs=500, batch_size=32, verbose=0)
print("Training finished.")

# You can plot the loss curve (total loss = NLL + KL divergence)
# plt.plot(history.history['loss'])
# plt.title('Model Loss During Training')
# plt.xlabel('Epoch')
# plt.ylabel('Total Loss (-ELBO)')
# plt.show()

Making Predictions and Visualizing Uncertainty

A main advantage of BNNs is their ability to quantify uncertainty. With VI, we approximate the posterior $p(w|\mathcal{D})$ with $q(w)$ . To get predictive uncertainty, we perform multiple forward passes through the network, each time sampling a different set of weights $w_i \sim q(w)$ . The variation in the outputs reflects the model's epistemic uncertainty (uncertainty about the model parameters).

# Make predictions by sampling multiple times
n_samples = 100
predictions_mc = np.stack([bnn_model(X_test).numpy() for _ in range(n_samples)], axis=0)

# Squeeze unnecessary dimensions
predictions_mc = np.squeeze(predictions_mc) # Shape: (n_samples, n_test_points)

# Calculate predictive mean and standard deviation
pred_mean = np.mean(predictions_mc, axis=0)
pred_std = np.std(predictions_mc, axis=0)

# Visualize the results: mean prediction and uncertainty bounds
fig = go.Figure()

# Uncertainty bounds (e.g., +/- 2 standard deviations)
fig.add_trace(go.Scatter(
    x=np.concatenate([X_test.flatten(), X_test.flatten()[::-1]]),
    y=np.concatenate([pred_mean - 2 * pred_std, (pred_mean + 2 * pred_std)[::-1]]),
    fill='toself',
    fillcolor='rgba(250, 82, 82, 0.2)', # Faint red color #fa5252
    line=dict(color='rgba(255,255,255,0)'),
    hoverinfo="skip",
    showlegend=False,
    name='Epistemic Uncertainty (±2 std)'
))

# Mean prediction
fig.add_trace(go.Scatter(
    x=X_test.flatten(), y=pred_mean,
    mode='lines', name='Predictive Mean',
    line=dict(color='#f03e3e') # Red color #f03e3e
))

# Original training data
fig.add_trace(go.Scatter(
    x=X_train.flatten(), y=y_train.flatten(),
    mode='markers', name='Training Data',
    marker=dict(color='#1c7ed6', size=6) # Blue color #1c7ed6
))

fig.update_layout(
    title='BNN Regression with Uncertainty',
    xaxis_title='Input (x)',
    yaxis_title='Output (y)',
    template='plotly_white',
    legend_title_text='Components'
)
# fig.show() # Use this in a Python environment to display

BNN predictive mean (red line) captures the underlying trend, while the shaded area (±2 standard deviations from the mean) represents epistemic uncertainty. Notice the uncertainty increases in regions with no training data (e.g., $x < -3$ or $x > 3$ ) and also where the function changes rapidly.

Alternative: MC Dropout

As discussed previously, Monte Carlo (MC) Dropout offers a simpler way to approximate Bayesian inference in existing standard NNs. It involves:

Training a standard neural network with dropout layers.
At prediction time, keeping dropout active and performing multiple forward passes for the same input.
Calculating the mean and variance/standard deviation of these multiple predictions to estimate the predictive mean and uncertainty.

While computationally cheaper and easier to implement in standard frameworks, MC Dropout is an approximation to a specific type of BNN (related to Gaussian Processes). The VI approach we implemented is often considered a more principled way to construct BNNs with explicit priors and posteriors.

Summary and Next Steps

In this practical section, we constructed a Bayesian Neural Network using TensorFlow Probability's DenseVariational layers. We trained it using Variational Inference, where the objective function balanced fitting the data (via negative log-likelihood/MSE) and adhering to prior beliefs (via KL divergence). By sampling from the learned approximate posterior distribution of weights, we generated predictions along with quantifiable epistemic uncertainty estimates.

This example provides a foundation for applying BNNs. You could extend this by:

Modeling aleatoric uncertainty explicitly by having the network predict the variance (scale parameter) of the output distribution.
Trying different network architectures, priors, or variational families.
Applying BNNs to classification tasks (requiring a different likelihood, like Categorical).
Exploring MCMC methods like SGHMC for potentially more accurate (but often slower) posterior sampling.
Comparing the performance and calibration of the BNN against a standard NN and MC Dropout.

Building BNNs provides a powerful framework for creating deep learning models that not only predict but also understand their own confidence.

Was this section helpful?