ActNorm and Invertible Convolutions

The Glow architecture introduced two structural improvements to coupling networks: Activation Normalization, commonly called ActNorm, and invertible $1 imes 1$ convolutions. These components improve training stability and performance in flow models by replacing standard batch normalization and fixed channel permutations.

Normalizing flows for high resolution images require significant memory. This often forces the use of very small batch sizes, sometimes as small as a single image per batch. Standard batch normalization performs poorly in this scenario because the batch statistics become highly noisy. ActNorm addresses this limitation by applying an affine transformation using scale and bias parameters initialized based on the first batch of data.

For an input tensor $x$ with spatial dimensions and channels, ActNorm applies a channel-wise scale $s$ and bias $b$ .

$y_{i, j, c} = x_{i, j, c} \cdot s_c + b_c$

Here, $i$ and $j$ denote spatial coordinates, and $c$ denotes the channel index. During the very first forward pass, $s$ and $b$ are initialized such that the output activations have a mean of zero and a variance of one across the given batch. After this initial step, $s$ and $b$ are treated as regular trainable parameters independent of batch statistics.

Since this is an affine transformation, the log determinant of the Jacobian is straightforward to compute. It is simply the sum of the log absolute values of the scale parameters multiplied by the spatial dimensions.

$\log \det J = h \cdot w \cdot \sum_{c} \log |s_c|$

Where $h$ and $w$ are the height and width of the input tensor.

In RealNVP, information is mixed between the two halves of the partitioned data using fixed operations like reversing the channel order. While effective, fixed permutations limit the flexibility of the model. A $1 \times 1$ convolution is a linear transformation applied across the channel dimension. By replacing the fixed permutation with a learned $1 \times 1$ convolution, the model can discover the most effective way to blend channels.

Let $x$ be a tensor of shape $(h, w, c)$ and $W$ be a $c \times c$ weight matrix. The convolution operation at each spatial location is:

$y_{i,j} = W x_{i,j}$

To maintain the normalizing flow requirements, $W$ must be invertible. The log determinant of the Jacobian for this operation across the entire spatial grid is:

$\log \det J = h \cdot w \cdot \log |\det(W)|$

Computing the determinant of a $c \times c$ matrix normally requires cubic time complexity. When evaluating this loss at every step of training, the computation becomes expensive. To compute this efficiently, the weight matrix $W$ is parameterized using its LU decomposition.

$W = P L (U + \text{diag}(s))$

Here, $P$ is a fixed permutation matrix, $L$ is a lower triangular matrix with ones on the diagonal, $U$ is an upper triangular matrix with zeros on the diagonal, and $s$ is a vector. Because the determinant of a triangular matrix is the product of its diagonal elements, the log determinant of $W$ simplifies to the sum of the log absolute values of $s$ .

$\log |\det(W)| = \sum_{i} \log |s_i|$

This reduces the computational cost significantly during the forward and inverse passes.

A standard step in modern coupling architectures combines these three components in sequence. First, ActNorm normalizes the activations. Next, the invertible $1 \times 1$ convolution mixes the channels. Finally, an affine coupling layer applies the non-linear transformation.

Sequence of operations in a single step of the Glow architecture.

Was this section helpful?