Practical Quantization for Large Language Models
Chapter 1: Foundations of Model Quantization
Introduction to Model Compression
Why Quantize Large Language Models?
Representing Numbers: Floating-Point vs. Fixed-Point
Integer Data Types in Quantization
Quantization Schemes: Symmetric vs. Asymmetric
Quantization Granularity Options
Measuring Quantization Error
Overview of Quantization Techniques
Chapter 2: Post-Training Quantization (PTQ)
Principles of Post-Training Quantization
Calibration: Selecting Representative Data
Static vs. Dynamic Quantization
Applying PTQ to LLM Layers
Hands-on Practical: Applying Static PTQ
Chapter 3: Advanced PTQ Techniques
Understanding GPTQ Algorithm Mechanics
AWQ: Activation-aware Weight Quantization
SmoothQuant: Mitigating Activation Outliers
Comparing Advanced PTQ Methods
Implementation Considerations for Advanced PTQ
Hands-on Practical: Quantizing with GPTQ
Chapter 4: Quantization-Aware Training (QAT)
Need for Quantization-Aware Training
Simulating Quantization Effects During Training
Straight-Through Estimator (STE)
Implementing QAT with Deep Learning Frameworks
Fine-tuning Models with Quantization Nodes
Benefits and Drawbacks of QAT vs. PTQ
Practical Considerations for QAT Execution
Hands-on Practical: Setting up a Simple QAT Run
Chapter 5: Quantization Formats and Tooling
Overview of Common Quantized Model Formats
GGUF: Structure and Usage
GPTQ Format: Library Support and Implementation
Working with Hugging Face Transformers and Optimum
Using bitsandbytes for Quantization
Tools for Model Conversion and Loading
Practice: Converting and Loading Quantized Formats
Chapter 6: Evaluating and Deploying Quantized LLMs
Metrics for Evaluating Quantized Models
Benchmarking Inference Speed and Memory Usage
Hardware Considerations for Quantized Inference
Deployment Strategies for Quantized LLMs
Troubleshooting Common Quantization Issues
Analyzing Accuracy vs. Performance Trade-offs
Practice: Benchmarking a Quantized LLM