When quantizing a model, we map high-precision floating-point numbers to lower-precision integers using a scale factor $s$ and a zero-point $z$. A fundamental question arises: should we use the same $s$ and $z$ for an entire tensor (like all the weights in a layer), or should we use different parameters for different parts of the tensor? This choice determines the quantization granularity. The level of granularity significantly impacts the trade-off between model compression, inference speed, and potential accuracy loss.Per-Tensor QuantizationThe simplest approach is per-tensor quantization. Here, a single scale factor $s$ and a single zero-point $z$ are calculated and applied to all the values within an entire tensor. For instance, if we consider the weight matrix of a linear layer, we would find the minimum and maximum values across all weights in that matrix to determine one pair of $(s, z)$ values.$$ \text{quantized_tensor} = \text{clip}\left(\text{round}\left(\frac{\text{original_tensor}}{s}\right) + z, Q_{\min}, Q_{\max}\right) $$Where $s$ and $z$ are computed based on the global minimum and maximum values of original_tensor.Advantages:Simplicity: Easy to implement and manage.Minimal Overhead: Requires storing only one $s$ and one $z$ per tensor, adding negligible storage cost.Disadvantages:Sensitivity to Range: If a tensor contains a few large outlier values, the overall range [min, max] becomes very wide. This forces the scale factor $s$ to be large, potentially mapping the majority of smaller, more frequent values to just a few integer levels, thus losing precision and potentially harming accuracy significantly.digraph G { rankdir=LR; node [shape=plaintext, fontsize=10]; subgraph cluster_tensor { label = "Weight Tensor (e.g., 4x4)"; style=dashed; bgcolor="#e9ecef"; T [label=< <TABLE BORDER="1" CELLBORDER="1" CELLSPACING="0"> <TR><TD>-1.2</TD><TD>0.5</TD><TD>8.1</TD><TD>-0.3</TD></TR> <TR><TD>0.8</TD><TD>-0.1</TD><TD>0.9</TD><TD>1.5</TD></TR> <TR><TD>-0.5</TD><TD>1.1</TD><TD>-0.8</TD><TD>0.2</TD></TR> <TR><TD>1.3</TD><TD>-0.9</TD><TD>0.4</TD><TD>-1.1</TD></TR> </TABLE> >]; } subgraph cluster_params { label = "Per-Tensor Parameters"; style=dashed; bgcolor="#fff3bf"; P [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#ffe066">Global Min: -1.2</TD></TR> <TR><TD BGCOLOR="#ffe066">Global Max: 8.1</TD></TR> <TR><TD> → Single (s, z) </TD></TR> </TABLE> >]; } T -> P [style=invis]; }A single set of quantization parameters (s, z) is derived from the minimum and maximum values across the entire tensor. The large value 8.1 significantly widens the range.Per-Channel QuantizationTo address the limitations of per-tensor quantization, especially in layers like convolutional or linear layers, we can use per-channel quantization. For a weight tensor in a linear layer (shape [output_features, input_features]) or a convolutional layer (shape [output_channels, input_channels, height, width]), this typically means calculating a separate pair of $(s, z)$ values for each output channel.Imagine slicing the weight tensor along the output channel dimension. For each slice (representing the weights connecting to one output neuron or channel), we independently find the minimum and maximum values and compute its specific $s$ and $z$.Advantages:Improved Accuracy: By adapting to the specific range of values within each channel, this method often preserves accuracy much better than per-tensor quantization, as outliers in one channel don't affect the quantization of others.Widely Applicable: It's a common and effective technique for quantizing weights in many standard network architectures.Disadvantages:Increased Overhead: We now need to store one $(s, z)$ pair per output channel, increasing the storage overhead compared to per-tensor (though typically still small relative to the original weight size).Slightly More Complex: Requires channel-wise computation of parameters and application during quantization/dequantization.digraph G { rankdir=LR; node [shape=plaintext, fontsize=10]; subgraph cluster_tensor { label = "Weight Tensor (e.g., 4x4)"; style=dashed; bgcolor="#e9ecef"; T [label=< <TABLE BORDER="1" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#a5d8ff">-1.2</TD><TD BGCOLOR="#a5d8ff">0.5</TD><TD BGCOLOR="#a5d8ff">8.1</TD><TD BGCOLOR="#a5d8ff">-0.3</TD></TR> <TR><TD BGCOLOR="#96f2d7">0.8</TD><TD BGCOLOR="#96f2d7">-0.1</TD><TD BGCOLOR="#96f2d7">0.9</TD><TD BGCOLOR="#96f2d7">1.5</TD></TR> <TR><TD BGCOLOR="#ffc9c9">-0.5</TD><TD BGCOLOR="#ffc9c9">1.1</TD><TD BGCOLOR="#ffc9c9">-0.8</TD><TD BGCOLOR="#ffc9c9">0.2</TD></TR> <TR><TD BGCOLOR="#ffd8a8">1.3</TD><TD BGCOLOR="#ffd8a8">-0.9</TD><TD BGCOLOR="#ffd8a8">0.4</TD><TD BGCOLOR="#ffd8a8">-1.1</TD></TR> </TABLE> >]; } subgraph cluster_params { label = "Per-Channel Parameters"; style=dashed; bgcolor="#d0bfff"; P [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#a5d8ff">Channel 1: Min=-1.2, Max=8.1 → (s1, z1)</TD></TR> <TR><TD BGCOLOR="#96f2d7">Channel 2: Min=-0.1, Max=1.5 → (s2, z2)</TD></TR> <TR><TD BGCOLOR="#ffc9c9">Channel 3: Min=-0.8, Max=1.1 → (s3, z3)</TD></TR> <TR><TD BGCOLOR="#ffd8a8">Channel 4: Min=-1.1, Max=1.3 → (s4, z4)</TD></TR> </TABLE> >]; } T -> P [style=invis]; }Separate quantization parameters (s, z) are calculated for each channel (row in this example). The range for Channel 1 is still wide due to 8.1, but other channels (e.g., Channel 2, 3, 4) use ranges tailored to their specific values, preserving more precision.Per-Group QuantizationPushing granularity even further, per-group quantization subdivides the elements within each channel (or row/column) into smaller groups and calculates separate $(s, z)$ parameters for each group. For a weight matrix of shape [output_features, input_features], grouping is often applied along the input_features dimension. Common group sizes are 32, 64, or 128.For example, if input_features is 1024 and the group size is 128, each output channel's weight vector would be split into $1024 / 128 = 8$ groups, and each group would have its own $(s, z)$.Advantages:Highest Accuracy Potential: Offers the most fine-grained adaptation to local value distributions. This is particularly beneficial for very low-precision quantization (e.g., INT4, INT3) where capturing local variations is important for minimizing accuracy loss. Techniques like GPTQ often rely on group-wise quantization.Handles Local Variations: Can effectively manage tensors where the range of values varies significantly even within a single channel.Disadvantages:Maximum Overhead: Requires storing the largest number of $(s, z)$ parameters, significantly increasing the metadata overhead compared to per-channel or per-tensor.Increased Complexity: The computation during quantization and inference becomes more complex due to managing parameters at the group level.digraph G { rankdir=LR; node [shape=plaintext, fontsize=10]; subgraph cluster_tensor { label = "Weight Tensor (Focus on one Channel)"; style=dashed; bgcolor="#e9ecef"; T [label=< <TABLE BORDER="1" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#a5d8ff" width="30">-1.2</TD><TD BGCOLOR="#a5d8ff" width="30">0.5</TD><TD BGCOLOR="#74c0fc" width="30">8.1</TD><TD BGCOLOR="#74c0fc" width="30">-0.3</TD><TD BGCOLOR="#4dabf7" width="30">0.8</TD><TD BGCOLOR="#4dabf7" width="30">-0.1</TD><TD BGCOLOR="#339af0" width="30">0.9</TD><TD BGCOLOR="#339af0" width="30">1.5</TD></TR> <TR><TD>...</TD><TD>...</TD><TD>...</TD><TD>...</TD><TD>...</TD><TD>...</TD><TD>...</TD><TD>...</TD></TR> </TABLE> >]; } subgraph cluster_params { label = "Per-Group Parameters (Group size = 2)"; style=dashed; bgcolor="#c0eb75"; P [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#a5d8ff">Channel 1, Group 1: Min=-1.2, Max=0.5 → (s11, z11)</TD></TR> <TR><TD BGCOLOR="#74c0fc">Channel 1, Group 2: Min=-0.3, Max=8.1 → (s12, z12)</TD></TR> <TR><TD BGCOLOR="#4dabf7">Channel 1, Group 3: Min=-0.1, Max=0.8 → (s13, z13)</TD></TR> <TR><TD BGCOLOR="#339af0">Channel 1, Group 4: Min=0.9, Max=1.5 → (s14, z14)</TD></TR> <TR><TD>... (Parameters for other channels)</TD></TR> </TABLE> >]; } T -> P [style=invis]; }Within a single channel (row), weights are divided into groups (here, size 2 for illustration). Each group gets its own parameters (s, z), allowing even finer adaptation to value ranges. The outlier 8.1 now only affects the parameters for Group 2.Selecting the GranularityThe choice of quantization granularity depends on several factors:Target Precision: For less aggressive quantization (e.g., INT8), per-channel often provides a good balance. For very low precision (INT4, INT3), per-group quantization is frequently necessary to maintain acceptable accuracy.Tensor Type: Weights often benefit from finer granularity (per-channel or per-group) because their distributions can vary significantly across channels or even within channels. Activations, sometimes, are quantized per-tensor, although dynamic quantization (per-token, often per-tensor within that token) is also common.Hardware/Software Support: The target inference engine or hardware might have optimized routines for specific granularities.Overhead Tolerance: Finer granularity increases the storage needed for scale factors and zero-points and can add computational overhead during inference setup or dequantization steps.Accuracy Requirements: The most critical factor. If simpler schemes like per-tensor cause an unacceptable drop in model performance on downstream tasks, finer granularities become necessary.In practice, per-channel quantization is a common default for weights, while per-group quantization is employed when pushing precision lower or when per-channel proves insufficient. Per-tensor is simplest but often reserved for activations or situations where accuracy impact is minimal. Understanding these options allows you to make informed decisions when applying quantization to your LLMs.