A significant motivation for quantizing large language models is the potential for substantial memory savings. Reduced memory consumption allows models to fit onto hardware with less RAM (like consumer GPUs or edge devices), enables larger batch sizes during inference, or facilitates the deployment of bigger, more capable models on existing infrastructure. Evaluating this memory reduction involves looking at two distinct aspects: the storage space required for the model files on disk and the dynamic memory consumed during runtime operation.
The most direct benefit of quantization is the reduction in the model's file size. When you convert model parameters (primarily weights) from higher-precision formats like 32-bit floating-point (FP32) or 16-bit floating-point (FP16) to lower-precision formats like 8-bit integer (INT8) or 4-bit integer (INT4), each parameter requires fewer bits for storage.
The theoretical reduction is straightforward to calculate. For instance, quantizing an FP16 model (16 bits per parameter) to INT4 (4 bits per parameter) should ideally result in a file size reduction of approximately 75%.
Reduction Factor=1−Original BitsQuantized Bits Example (FP16 to INT4): Reduction=1−164=0.75 or 75%In practice, you measure this by comparing the file sizes of the original and quantized models. Standard operating system commands or simple scripts suffice.
# Check file size on Linux/macOS
ls -lh model_fp16.safetensors
ls -lh model_int4.safetensors
import os
fp16_path = "path/to/model_fp16.safetensors"
int4_path = "path/to/model_int4.safetensors"
fp16_size_bytes = os.path.getsize(fp16_path)
int4_size_bytes = os.path.getsize(int4_path)
reduction_percentage = (1 - int4_size_bytes / fp16_size_bytes) * 100
print(f"FP16 Model Size: {fp16_size_bytes / (1024**3):.2f} GB")
print(f"INT4 Model Size: {int4_size_bytes / (1024**3):.2f} GB")
print(f"Disk Size Reduction: {reduction_percentage:.2f}%")
Keep in mind that the actual reduction might deviate slightly from the theoretical value. Reasons include:
.gguf
or .safetensors
store quantized weights efficiently, but the exact size depends on the specific quantization method and parameters used.Relative model size compared to a baseline FP16 model for common quantization bit widths.
While disk size reduction is important, the memory consumed during inference execution (runtime memory) is often the more critical factor, especially on memory-constrained hardware like GPUs. Measuring this is more complex as it encompasses several components:
NVIDIA GPUs:
nvidia-smi
: Provides a snapshot of GPU memory usage. Useful for a quick check.pynvml
library (Python bindings for NVML): Allows programmatic querying of memory usage within your script.torch.cuda.memory_allocated()
reports the tensor memory currently allocated by PyTorch, while torch.cuda.max_memory_allocated()
tracks the peak tensor allocation during execution. torch.cuda.memory_reserved()
and torch.cuda.max_memory_reserved()
report the total memory managed by PyTorch's caching allocator, which is often higher than just allocated tensor memory due to fragmentation and caching.import torch
# Assume model is loaded on GPU
# ... perform inference ...
peak_allocated_gb = torch.cuda.max_memory_allocated() / (1024**3)
peak_reserved_gb = torch.cuda.max_memory_reserved() / (1024**3)
print(f"Peak Tensor Memory Allocated: {peak_allocated_gb:.2f} GB")
print(f"Peak Memory Reserved by PyTorch: {peak_reserved_gb:.2f} GB")
CPU Memory: Standard OS tools like top
, htop
(Linux/macOS) or Task Manager (Windows) can monitor process memory. Python's psutil
library provides programmatic access.
import psutil
import os
process = psutil.Process(os.getpid())
mem_info = process.memory_info()
print(f"RSS Memory: {mem_info.rss / (1024**2):.2f} MB") # Resident Set Size
Profiling Tools: Specialized tools and features within deployment frameworks (e.g., TensorRT profiler, vLLM monitoring endpoints) often provide more granular insights into memory usage patterns, including workspace size and activation memory.
To get a reliable assessment, measure peak memory usage under realistic inference loads. A good practice is to:
It's essential to measure peak usage, as this determines the actual hardware requirement. Be aware that memory usage can fluctuate significantly during model loading, the first inference pass (due to kernel compilation or initialization), and during generation as the KV cache grows.
By carefully measuring both disk and runtime memory consumption, you gain a clear picture of the efficiency improvements offered by quantization. This data, combined with the latency, throughput, and accuracy evaluations discussed elsewhere in this chapter, provides the comprehensive understanding needed to decide if a quantized model meets your deployment requirements. Remember that the most aggressive quantization yielding the smallest memory footprint might not always be the best choice if it compromises accuracy or compatibility with optimized runtime kernels excessively.
© 2025 ApX Machine Learning