All Courses

Practical Quantization for Large Language Models

Chapter 1: Foundations of Model Quantization

Introduction to Model Compression

Why Quantize Large Language Models?

Representing Numbers: Floating-Point vs. Fixed-Point

Integer Data Types in Quantization

Quantization Schemes: Symmetric vs. Asymmetric

Quantization Granularity Options

Measuring Quantization Error

Overview of Quantization Techniques

Quiz for Chapter 1

Chapter 2: Post-Training Quantization (PTQ)

Principles of Post-Training Quantization

Calibration: Selecting Representative Data

Static vs. Dynamic Quantization

Common PTQ Algorithms

Handling Outliers in PTQ

Applying PTQ to LLM Layers

Limitations of Basic PTQ

Hands-on Practical: Applying Static PTQ

Quiz for Chapter 2

Chapter 3: Advanced PTQ Techniques

Introduction to GPTQ

Understanding GPTQ Algorithm Mechanics

AWQ: Activation-aware Weight Quantization

SmoothQuant: Mitigating Activation Outliers

Comparing Advanced PTQ Methods

Implementation Considerations for Advanced PTQ

Hands-on Practical: Quantizing with GPTQ

Quiz for Chapter 3

Chapter 4: Quantization-Aware Training (QAT)

Need for Quantization-Aware Training

Simulating Quantization Effects During Training

Straight-Through Estimator (STE)

Implementing QAT with Deep Learning Frameworks

Fine-tuning Models with Quantization Nodes

Benefits and Drawbacks of QAT vs. PTQ

Practical Considerations for QAT Execution

Hands-on Practical: Setting up a Simple QAT Run

Quiz for Chapter 4

Chapter 5: Quantization Formats and Tooling

Overview of Common Quantized Model Formats

GGUF: Structure and Usage

GPTQ Format: Library Support and Implementation

AWQ Format Details

Working with Hugging Face Transformers and Optimum

Using bitsandbytes for Quantization

Tools for Model Conversion and Loading

Practice: Converting and Loading Quantized Formats

Quiz for Chapter 5

Chapter 6: Evaluating and Deploying Quantized LLMs

Metrics for Evaluating Quantized Models

Benchmarking Inference Speed and Memory Usage

Hardware Considerations for Quantized Inference

Deployment Strategies for Quantized LLMs

Troubleshooting Common Quantization Issues

Analyzing Accuracy vs. Performance Trade-offs

Practice: Benchmarking a Quantized LLM

Quiz for Chapter 6

Limitations of Basic PTQ

Was this section helpful?

References

A Survey of Quantization Methods for Efficient Neural Network Inference, Amir Gholami, Song Han, Sijia Liu, Jianbo Liu, Minhao Cheng, Sehoon Kim, Zhongnan Qu, Hesham Mostafa, William J. Dally, Mahmoud Naghshvari, Kurt Keutzer, 2021 Frontiers in Neuroscience, Vol. 15 (Frontiers Media SA) DOI: 10.3389/fnins.2021.647915 - A comprehensive survey of various quantization techniques for neural networks, discussing their motivations and the challenges associated with simpler approaches, including issues like outlier sensitivity.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, 2022 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.2208.07339 - This paper identifies and addresses the problem of outliers in large language model activations that severely degrade accuracy during quantization, explaining why basic PTQ struggles with LLMs.
GPTQ: Accurate Post-training Quantization for Generative Pre-trained Transformers, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2023 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.2210.17323 - Introduces an advanced PTQ method by detailing the significant accuracy degradation of basic PTQ for large language models, particularly at aggressive low bit-widths (e.g., INT4).

© 2025 ApX Machine LearningEngineered with