Activation-aware Weight Quantization (AWQ) stands out because it selectively preserves weight precision based on corresponding activation magnitudes, aiming for better accuracy, particularly at very low bit-depths like 4-bit (INT4). As discussed in Chapter 3, this involves calculating scaling factors for weight groups based not just on the weights themselves, but on how important they are deemed by analyzing activation patterns from a calibration dataset.This activation-aware approach necessitates storing more than just the packed low-bit weights. The carefully calculated scaling factors are essential for correctly dequantizing or performing computations with the quantized weights during inference. Therefore, the "format" for an AWQ model revolves around how these quantized weights and their associated scaling factors are saved and structured for later use.Structure of AWQ Quantized ModelsUnlike GGUF, there isn't a single, standardized file extension or monolithic container defining the "AWQ format." Instead, AWQ implementations commonly adapt existing model storage conventions, such as those used by Hugging Face Transformers, by saving specific artifacts alongside the standard model components.When you quantize a model using a library like autoawq, the output directory typically contains:Quantized Weights: The core of the quantized model. These are the low-precision weights (e.g., INT4, INT3). Because standard tensor formats don't natively support types like INT4, these weights are often "packed" into a standard integer type (like INT8, where two INT4 values fit into one INT8 byte). These packed weights are usually saved in common tensor serialization formats like .safetensors or .bin files (PyTorch's pytorch_model.bin).Scaling Factors: This is a distinguishing feature of AWQ models. These floating-point values, calculated during the AWQ process based on activation statistics, are necessary to scale the quantized weights correctly during inference. They are typically saved per-channel or per-group, matching the granularity used during quantization (e.g., a group size of 128 is common). These scales might be stored within the same tensor file as the weights or in a separate dedicated file.Zero Points: While AWQ often employs symmetric quantization (zero-point is zero), if asymmetric quantization is used, the corresponding zero-points (integers) must also be saved alongside the scales. These are usually stored with the same granularity (per-channel or per-group) as the scaling factors.Updated Configuration File: The model's standard configuration file (e.g., config.json) is usually modified or augmented. It needs to contain metadata indicating that the model is AWQ-quantized and specify the parameters used:Quantization method identifier (e.g., "quant_method": "awq").Bit depth (e.g., "bits": 4).Group size (e.g., "group_size": 128).Whether zero-points are used/saved.Information mapping the quantized layers back to their corresponding scales and zero-points.Sometimes, it includes architectural hints or specifies custom model classes (like "auto_map": { "AutoModelForCausalLM": "awq.models.auto.AutoAWQForCausalLM" }) to guide loading libraries.digraph AWQFormat { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_fp32 { label = "Original FP32 Model"; bgcolor="#a5d8ff"; fp32_weights [label="FP32 Weights\n(e.g., model.safetensors)", fillcolor="#dee2e6"]; config [label="config.json", fillcolor="#dee2e6"]; } subgraph cluster_awq { label = "AWQ Quantized Model Directory"; bgcolor="#96f2d7"; quant_weights [label="Packed INT4 Weights\n(e.g., qmodel.safetensors)", fillcolor="#ced4da"]; scales_zeros [label="Scales & Zero-Points\n(often within weight file)", fillcolor="#ffec99"]; awq_config [label="Modified config.json\n(contains quant_config)", fillcolor="#ffc9c9"]; quant_weights -> scales_zeros [style=invis]; } fp32_weights -> quant_weights [label=" AWQ Quantization ", color="#495057"]; fp32_weights -> scales_zeros [label=" Scale Calculation \n (using activations) ", color="#495057"]; config -> awq_config [label=" Add AWQ Metadata ", color="#495057"]; }Typical components found in a directory containing an AWQ-quantized model, derived from the original FP32 model. The main additions are the packed low-bit weights, the activation-aware scales (and optional zero-points), and the updated configuration file specifying AWQ parameters.Loading and Using AWQ ModelsLoading an AWQ-quantized model requires specific logic that understands this structure. You generally cannot load an AWQ model using standard transformers from_pretrained calls directly without appropriate integration or custom code. Libraries designed for AWQ handle this process:Detection: The loading code first inspects the config.json to identify the model as AWQ-quantized and retrieve parameters like bit depth and group size.Weight and Parameter Loading: It loads the packed low-bit weights, the important scaling factors, and zero-points (if any).Model Instantiation: An instance of the base model architecture is created.Layer Replacement: Standard layers (like torch.nn.Linear) are replaced with specialized AWQ layers. These custom layers are designed to perform computations using the quantized weights and incorporate the scaling factors (and zero-points) efficiently during the forward pass. This often involves unpacking the weights on-the-fly or using custom computation kernels.Kernel Setup (Optional): For optimal performance, especially on GPUs, AWQ libraries often rely on custom CUDA kernels for the low-bit matrix multiplications incorporating the scaling. The loading process ensures these kernels are available and configured.A typical loading pattern using a library like autoawq simplifies this significantly for the user:# Example using the autoawq library from awq import AutoAWQForCausalLM # Path to the directory containing config.json, quantized weights, etc. quantized_model_dir = "path/to/your/awq_model" # Load the quantized model and tokenizer # The library handles reading config, loading weights/scales, setting up layers model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device_map="auto") # Tokenizer usually loaded separately standardly # tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir) # Model is now ready for inferencePractical NotesLibrary Dependence: Since there's no single file standard, AWQ models are often tightly coupled to the library that created them (e.g., a model quantized with autoawq is best loaded with autoawq). Ensure you use compatible library versions for quantization and inference.Kernel Requirements: High performance often relies on custom CUDA kernels. This means having a compatible GPU, CUDA toolkit version, and correctly installed library components is necessary. Inference might fall back to slower implementations if kernels are unavailable.Format Variations: While the principles are similar, minor variations in how scales/weights are stored or named might exist between different AWQ implementations or library versions. Always refer to the documentation of the specific tool used for quantization.Understanding the components saved for an AWQ model helps in managing these models and appreciating the role of specialized libraries in loading and executing them efficiently. The takeaway is the necessity of storing and correctly utilizing the activation-aware scaling factors, which distinguishes AWQ's storage requirements from simpler quantization methods.