A significant motivation for quantizing large language models is the potential for substantial memory savings. Reduced memory consumption allows models to fit onto hardware with less RAM (like consumer GPUs or edge devices), enables larger batch sizes during inference, or facilitates the deployment of bigger, more capable models on existing infrastructure. Evaluating this memory reduction involves looking at two distinct aspects: the storage space required for the model files on disk and the dynamic memory consumed during runtime operation.

Measuring Disk Size Reduction

The most direct benefit of quantization is the reduction in the model's file size. When you convert model parameters (primarily weights) from higher-precision formats like 32-bit floating-point (FP32) or 16-bit floating-point (FP16) to lower-precision formats like 8-bit integer (INT8) or 4-bit integer (INT4), each parameter requires fewer bits for storage.

The theoretical reduction is straightforward to calculate. For instance, quantizing an FP16 model (16 bits per parameter) to INT4 (4 bits per parameter) should ideally result in a file size reduction of approximately 75%.

\text{Reduction Factor} = 1 - \frac{\text{Quantized Bits}}{\text{Original Bits}}

\text{Example (FP16 to INT4): Reduction} = 1 - \frac{4}{16} = 0.75 \text{ or } 75\%

In practice, you measure this by comparing the file sizes of the original and quantized models. Standard operating system commands or simple scripts suffice.

# Check file size on Linux/macOS
ls -lh model_fp16.safetensors
ls -lh model_int4.safetensors

import os

fp16_path = "path/to/model_fp16.safetensors"
int4_path = "path/to/model_int4.safetensors"

fp16_size_bytes = os.path.getsize(fp16_path)
int4_size_bytes = os.path.getsize(int4_path)

reduction_percentage = (1 - int4_size_bytes / fp16_size_bytes) * 100

print(f"FP16 Model Size: {fp16_size_bytes / (1024**3):.2f} GB")
print(f"INT4 Model Size: {int4_size_bytes / (1024**3):.2f} GB")
print(f"Disk Size Reduction: {reduction_percentage:.2f}%")

Keep in mind that the actual reduction might deviate slightly from the theoretical value. Reasons include:

Metadata: Model files contain metadata (configuration, quantization scales/zeros) which adds overhead.
Non-Quantized Parameters: Some parts of the model (e.g., embeddings, final layer) might be intentionally kept in higher precision to preserve accuracy, slightly reducing the overall compression ratio.
Quantization Scheme Overhead: Certain quantization formats (like those used by GPTQ or AWQ with group sizes) store additional scaling factors, contributing marginally to the file size. Popular formats like .gguf or .safetensors store quantized weights efficiently, but the exact size depends on the specific quantization method and parameters used.

Relative model size compared to a baseline FP16 model for common quantization bit widths.

Assessing Runtime Memory Usage

While disk size reduction is important, the memory consumed during inference execution (runtime memory) is often the more critical factor, especially on memory-constrained hardware like GPUs. Measuring this is more complex as it encompasses several components:

Model Weights: This is the most direct saving from quantization. Loading INT4 weights into GPU memory requires roughly 75% less VRAM than loading FP16 weights. This is often the largest contributor to runtime memory reduction.
Activations: These are the intermediate results computed during the forward pass. Depending on the inference framework and quantization strategy, activations may or may not be stored in a quantized format. Often, computations involving low-bit weights still produce intermediate activations in a higher precision (e.g., FP16 or even FP32) before potentially being quantized again for subsequent layers. Quantizing activations can yield further memory savings but often poses a greater challenge for maintaining accuracy.
Key-Value (KV) Cache: In autoregressive generation, the KV cache stores intermediate attention states for previously generated tokens. Its size scales with batch size, sequence length, and model dimensions. The data type used for the KV cache (commonly FP16 or FP8) significantly impacts memory usage, particularly for long context lengths. While the cache itself might not always use the same low-bit format as the weights, optimized inference engines often employ techniques like FP8 KV caching, which complements weight quantization for overall memory reduction.
Framework and Workspace Overhead: Inference libraries (like PyTorch, TensorFlow, TensorRT-LLM, vLLM) require memory for their own operations, CUDA contexts, intermediate buffers (workspace), and potentially for dequantization operations if optimized kernels are not available for all steps. This overhead can be non-trivial.

Tools and Techniques for Measurement

NVIDIA GPUs:
- nvidia-smi: Provides a snapshot of GPU memory usage. Useful for a quick check.
- pynvml library (Python bindings for NVML): Allows programmatic querying of memory usage within your script.
- PyTorch: torch.cuda.memory_allocated() reports the tensor memory currently allocated by PyTorch, while torch.cuda.max_memory_allocated() tracks the peak tensor allocation during execution. torch.cuda.memory_reserved() and torch.cuda.max_memory_reserved() report the total memory managed by PyTorch's caching allocator, which is often higher than just allocated tensor memory due to fragmentation and caching.
```
import torch
# Assume model is loaded on GPU
# ... perform inference ...
peak_allocated_gb = torch.cuda.max_memory_allocated() / (1024**3)
peak_reserved_gb = torch.cuda.max_memory_reserved() / (1024**3)
print(f"Peak Tensor Memory Allocated: {peak_allocated_gb:.2f} GB")
print(f"Peak Memory Reserved by PyTorch: {peak_reserved_gb:.2f} GB")
```

CPU Memory: Standard OS tools like top, htop (Linux/macOS) or Task Manager (Windows) can monitor process memory. Python's psutil library provides programmatic access.

import psutil
import os
process = psutil.Process(os.getpid())
mem_info = process.memory_info()
print(f"RSS Memory: {mem_info.rss / (1024**2):.2f} MB") # Resident Set Size

Profiling Tools: Specialized tools and features within deployment frameworks (e.g., TensorRT profiler, vLLM monitoring endpoints) often provide more granular insights into memory usage patterns, including workspace size and activation memory.

Benchmarking Runtime Memory

To get a reliable assessment, measure peak memory usage under realistic inference loads. A good practice is to:

Load the model onto the target device (GPU/CPU).
Run inference for representative tasks, varying parameters like:
- Batch size
- Input sequence length
- Output sequence length (for generation tasks)
Record the peak memory utilization during these runs using the appropriate tools mentioned above.
Compare the peak memory usage of the quantized model against the original high-precision model under identical conditions.

It's essential to measure peak usage, as this determines the actual hardware requirement. Be aware that memory usage can fluctuate significantly during model loading, the first inference pass (due to kernel compilation or initialization), and during generation as the KV cache grows.

By carefully measuring both disk and runtime memory consumption, you gain a clear picture of the efficiency improvements offered by quantization. This data, combined with the latency, throughput, and accuracy evaluations discussed elsewhere in this chapter, provides the comprehensive understanding needed to decide if a quantized model meets your deployment requirements. Remember that the most aggressive quantization yielding the smallest memory footprint might not always be the best choice if it compromises accuracy or compatibility with optimized runtime kernels excessively.