We've established the general principles of Post-Training Quantization, including calibration using representative data to determine appropriate scaling factors and zero-points. Now, let's focus on how these principles are applied specifically to the types of layers that dominate Large Language Model architectures. The most computationally intensive and memory-heavy layers in typical Transformer-based LLMs are the Linear (or Fully Connected) layers and the Embedding layers.
Linear layers perform matrix multiplication, forming the backbone of computations within the Feed-Forward Networks (FFN) and the Attention mechanisms of Transformers. A linear layer computes Y=XWT+b, where X is the input activation tensor, W is the weight matrix, and b is the optional bias vector.
Weight Quantization: The weight matrix W is typically the primary target for quantization in Linear layers due to its size. Using the calibration data (or sometimes just the weights themselves), we determine the range of values within W. Based on this range and the chosen quantization scheme (symmetric or asymmetric) and target bit depth (e.g., INT8, INT4), we calculate the scale (sW) and zero-point (zW) needed to map the FP32 weights to the lower-precision integer format.
Wquant=round(W/sW)+zWA significant consideration here is granularity:
Bias terms (b) are usually kept in FP32 or quantized to a higher precision integer format (like INT32) using the scales derived from weights and activations, as they represent a much smaller portion of the total parameters.
Activation Quantization (for Static PTQ): In static PTQ, the input activations X are also quantized. This requires observing the range of activation values during the calibration phase. We pass the calibration dataset through the model and record the minimum and maximum values observed for the input tensor X of each Linear layer.
From these observed ranges, we calculate the activation scale (sX) and zero-point (zX).
Xquant=round(X/sX)+zXActivations are often quantized per-tensor, mainly for performance reasons, as calculating scales per-channel at runtime (or storing per-channel statistics) adds complexity. However, as discussed previously, activation outliers can pose a significant challenge, potentially leading to large quantization errors if not handled carefully (e.g., using clipping or more advanced PTQ techniques covered in Chapter 3).
The quantized computation then approximates the original matrix multiplication using integer arithmetic, often accumulating results in a higher precision integer format (e.g., INT32) before dequantizing back to FP32 (or an intermediate format):
Y≈(Xquant−zX)sX⋅((Wquant−zW)sW)T+b(Note: The exact implementation details vary depending on the hardware and libraries used, focusing on efficient integer matrix multiplication.)
Static quantization workflow for a Linear layer (Y=XWT+b). Activations (X) and weights (W) are quantized using scales and zero-points derived during calibration. Computation occurs using integer arithmetic, followed by dequantization.
Embedding layers map discrete input tokens (represented by integer IDs) to dense floating-point vectors. They are essentially lookup tables where the "weights" form the embedding matrix. For an input sequence of token IDs, the layer retrieves the corresponding vectors.
Weight Quantization: The embedding matrix itself can be substantial, especially for models with large vocabularies (e.g., 50,000 tokens or more) and high embedding dimensions (e.g., 4096). Quantizing this matrix significantly reduces the model's memory footprint.
Similar to Linear layers, we apply quantization to the embedding table weights. The process involves determining the range of values across all embedding vectors and calculating the scale(s) and zero-point(s).
The choice depends on the desired trade-off between compression, accuracy, and implementation complexity. For memory savings, even per-tensor quantization of embeddings to INT8 provides a 4x reduction compared to FP32.
Activation Quantization: The direct input to an Embedding layer is a sequence of integer token IDs. These IDs are not typically "quantized" in the same sense as continuous activation values. They serve as indices for the lookup operation.
The quantization impact relates to the output of the embedding lookup. In a static PTQ scenario where subsequent layers expect quantized inputs, the fetched embedding vectors (which are originally FP32 if only weights are quantized, or dequantized from INT8/INT4) would need to be quantized before being fed into the next layer (e.g., the first attention layer). This quantization step would use scales and zero-points derived from observing the output embedding vector ranges during calibration.
However, it's common to focus PTQ primarily on the embedding weights for memory reduction. The subsequent layers then handle the quantization of their inputs (the looked-up embeddings) as needed.
Applying PTQ isn't always uniform across all layers.
By applying PTQ to the weight matrices of Linear and Embedding layers, we achieve substantial reductions in model size. When static PTQ is used, quantizing the activations further enables integer-based computations, potentially speeding up inference, although careful calibration is needed to manage the accuracy impact.
© 2025 ApX Machine Learning