When quantizing a model, we map high-precision floating-point numbers to lower-precision integers using a scale factor s and a zero-point z. A fundamental question arises: should we use the same s and z for an entire tensor (like all the weights in a layer), or should we use different parameters for different parts of the tensor? This choice determines the quantization granularity. The level of granularity significantly impacts the trade-off between model compression, inference speed, and potential accuracy loss. Let's examine the common options.
The simplest approach is per-tensor quantization. Here, a single scale factor s and a single zero-point z are calculated and applied to all the values within an entire tensor. For instance, if we consider the weight matrix of a linear layer, we would find the minimum and maximum values across all weights in that matrix to determine one pair of (s,z) values.
quantized_tensor=clip(round(soriginal_tensor)+z,Qmin,Qmax)Where s and z are computed based on the global minimum and maximum values of original_tensor
.
Advantages:
Disadvantages:
A single set of quantization parameters (s, z) is derived from the minimum and maximum values across the entire tensor. The large value 8.1 significantly widens the range.
To address the limitations of per-tensor quantization, especially in layers like convolutional or linear layers, we can use per-channel quantization. For a weight tensor in a linear layer (shape [output_features, input_features]
) or a convolutional layer (shape [output_channels, input_channels, height, width]
), this typically means calculating a separate pair of (s,z) values for each output channel.
Imagine slicing the weight tensor along the output channel dimension. For each slice (representing the weights connecting to one output neuron or channel), we independently find the minimum and maximum values and compute its specific s and z.
Advantages:
Disadvantages:
Separate quantization parameters (s, z) are calculated for each channel (row in this example). The range for Channel 1 is still wide due to 8.1, but other channels (e.g., Channel 2, 3, 4) use ranges tailored to their specific values, preserving more precision.
Pushing granularity even further, per-group quantization subdivides the elements within each channel (or row/column) into smaller groups and calculates separate (s,z) parameters for each group. For a weight matrix of shape [output_features, input_features]
, grouping is often applied along the input_features
dimension. Common group sizes are 32, 64, or 128.
For example, if input_features
is 1024 and the group size is 128, each output channel's weight vector would be split into 1024/128=8 groups, and each group would have its own (s,z).
Advantages:
Disadvantages:
Within a single channel (row), weights are divided into groups (here, size 2 for illustration). Each group gets its own parameters (s, z), allowing even finer adaptation to value ranges. The outlier 8.1 now only affects the parameters for Group 2.
The choice of quantization granularity depends on several factors:
In practice, per-channel quantization is a common default for weights, while per-group quantization is employed when pushing precision lower or when per-channel proves insufficient. Per-tensor is simplest but often reserved for activations or situations where accuracy impact is minimal. Understanding these options allows you to make informed decisions when applying quantization to your LLMs.
© 2025 ApX Machine Learning