Data Dequantization Methods

Normalizing flows are mathematically formulated to operate on continuous probability distributions. However, many standard datasets used in generative modeling consist of discrete values. Digital images represent colors using integers between 0 and 255. Audio waveforms are often quantized into discrete bins. When we train a continuous flow model on strictly discrete data, we encounter a serious optimization issue.

If a continuous model assigns probability density to a discrete set of points, it can achieve an arbitrarily high likelihood by collapsing the density into Dirac delta functions at those exact points. The density approaches infinity at the data locations, causing the model to completely overfit. The network learns nothing about the actual distribution of the data and simply memorizes the discrete inputs. We must convert discrete data into a continuous format before passing it through the flow. This process is called data dequantization.

Data transformation pipeline mapping discrete inputs to continuous signals for generative modeling.

Uniform Dequantization

The simplest and most common method is uniform dequantization. We treat the discrete data points as the left edges of unit intervals. For a discrete data point $x \in \{0, 1, ..., 255\}$ , we add continuous noise $u$ sampled from a uniform distribution between 0 and 1.

$y = x + u$

By adding this noise, we spread the probability mass of the discrete point uniformly across the interval $[x, x+1)$ . The new variable $y$ is strictly continuous. If we want to scale the image data to stabilize neural network training, we usually divide by 256 so that the continuous values lie perfectly in the range $[0, 1]$ .

$y = \frac{x + u}{256}$

When evaluating the model, maximizing the log-likelihood of the continuous variable $y$ corresponds to maximizing a lower bound on the log-likelihood of the discrete data $x$ . Because the noise is uniform, the probability of the discrete data point is exactly equal to the integral of the continuous density over the 1-dimensional volume of the noise.

$P(x) = \int_{0}^{1} p(x + u) du$

Implementing uniform dequantization in PyTorch requires only a few lines of code. We sample from a uniform distribution using the same shape as our input tensor and apply the transformation.

import torch

def uniform_dequantize(x, num_bins=256):
    """
    Applies uniform dequantization to discrete data x.
    x: Tensor of discrete values (e.g., image pixels).
    num_bins: Number of discrete bins.
    """
    # Sample uniform noise with the same shape as x
    noise = torch.rand_like(x)
    
    # Add noise and normalize to a [0, 1] continuous range
    y = (x + noise) / num_bins
    
    return y

# Example usage with a dummy 8-bit image tensor
discrete_image = torch.randint(0, 256, (1, 3, 32, 32), dtype=torch.float32)
continuous_image = uniform_dequantize(discrete_image)

Variational Dequantization

While uniform dequantization prevents infinite density spikes, it introduces its own challenges. The uniform noise distribution has sharp edges at the boundaries of the intervals. Normalizing flows, which map from smooth base distributions like isotropic Gaussians, struggle to model these sharp, non-differentiable boundaries efficiently. The flow network might waste capacity trying to model the artificial edges of the uniform noise rather than focusing on the underlying data features.

To improve training stability and model performance, researchers introduced variational dequantization. Instead of adding rigid uniform noise, we introduce a conditional learned distribution $q(u|x)$ to model the noise. This noise distribution is itself typically parameterized by a small auxiliary normalizing flow.

The objective is to learn a noise distribution that makes the continuous data $y = x + u$ as smooth and easy to model as possible for the primary normalizing flow. We optimize the variational lower bound of the log-likelihood. The training objective becomes:

$\log P(x) \geq \mathbb{E}_{u \sim q(u|x)} [\log p(x + u) - \log q(u|x)]$

In this formulation, $p(x + u)$ is the likelihood assigned by the main normalizing flow, and $q(u|x)$ is the likelihood of the noise given by the auxiliary dequantization flow. The network jointly trains the dequantization flow and the main generative flow. The dequantization flow actively learns to add noise that the main flow finds easy to assign high probability to. This smooths out the sharp boundaries of the uniform intervals and significantly improves the density estimation performance on complex datasets.

Using either uniform or variational dequantization ensures your data strictly adheres to the continuous requirements of the change of variables theorem. With the discrete structures converted to smooth signals, the network can safely optimize the exact maximum likelihood objective without overfitting to specific coordinates.

Was this section helpful?