Quantifying the benefits gained from quantization requires careful measurement of inference speed and memory consumption. While the previous section discussed evaluating model quality (like perplexity or task accuracy), here we focus on measuring the efficiency improvements, which are often the primary motivation for quantizing LLMs.
Inference speed tells us how quickly the model can process input and generate output. Two primary metrics are used: latency and throughput.
Latency Latency is the time taken to process a single input request. It's typically measured in milliseconds (ms) per token or milliseconds per request. Lower latency is better, signifying a faster response time, which is important for interactive applications like chatbots.
To measure latency reliably:
time.perf_counter()
in Python for high-resolution timing.import time
# Conceptual example - replace with actual model inference call
def run_inference(model, inputs):
# Simulate model processing time
time.sleep(0.1)
return "output"
# --- Warm-up Phase ---
print("Warming up...")
for _ in range(5):
_ = run_inference(model, sample_input)
# --- Measurement Phase ---
latencies = []
num_runs = 50
print(f"Measuring latency over {num_runs} runs...")
for i in range(num_runs):
start_time = time.perf_counter()
_ = run_inference(model, sample_input)
end_time = time.perf_counter()
latencies.append((end_time - start_time) * 1000) # Convert to milliseconds
avg_latency = sum(latencies) / len(latencies)
print(f"Average Latency: {avg_latency:.2f} ms per request")
Throughput Throughput measures the number of requests the model can handle in a given period, often expressed as requests per second or tokens per second. High throughput is desirable for applications serving many users concurrently or processing large datasets offline.
Throughput is heavily influenced by batching, where multiple input requests are processed simultaneously. Larger batch sizes generally increase throughput up to a point, limited by available hardware resources (like GPU memory).
To measure throughput:
import time
# Conceptual example - replace with actual batched inference
def run_batched_inference(model, batch_inputs):
# Simulate processing a batch
time.sleep(0.5) # Time depends on batch size
return ["output"] * len(batch_inputs)
batch_size = 8
num_batches = 20
total_requests = batch_size * num_batches
timings = []
print(f"Measuring throughput with batch size {batch_size}...")
# Warm-up (optional but recommended for batches too)
start_total_time = time.perf_counter()
for i in range(num_batches):
# Assume batch_input is ready
batch_input = [sample_input] * batch_size
_ = run_batched_inference(model, batch_input)
end_total_time = time.perf_counter()
total_time_taken = end_total_time - start_total_time
throughput = total_requests / total_time_taken
print(f"Processed {total_requests} requests in {total_time_taken:.2f} seconds.")
print(f"Throughput: {throughput:.2f} requests/second")
Quantization typically reduces latency and increases potential throughput because integer operations (like INT8 or INT4) are generally faster than floating-point operations (FP16, FP32) on compatible hardware, and smaller data types reduce memory bandwidth requirements.
Average latency per request for different quantization levels on representative hardware. Lower values indicate faster single-request processing.
Quantization's primary benefit is often reduced memory footprint. This includes both model weight storage and activation memory during inference.
Model Size (Storage) This is the easiest metric: simply check the size of the saved model file(s) on disk. Quantized models (e.g., GGUF, GPTQ files) will be significantly smaller than their FP16 or FP32 counterparts, roughly proportional to the reduction in bit width (e.g., INT4 should be about 1/4 the size of FP16).
Inference Memory (Peak Usage) This measures the maximum RAM (CPU) or VRAM (GPU) consumed while the model is running inference. This is often more critical than disk size, as it determines the hardware required to run the model.
To measure peak memory usage:
nvidia-smi
(for NVIDIA GPUs) or library-specific functions (e.g., PyTorch's torch.cuda.max_memory_allocated()
) to monitor VRAM usage during inference. Run inference with a typical input and record the peak value.
# Terminal command to watch GPU memory while your script runs
watch -n 0.5 nvidia-smi
htop
on Linux, Activity Monitor on macOS, Task Manager on Windows) or Python libraries like psutil
to track the memory usage of the inference process.
import psutil
import os
import time
import torch # Assuming PyTorch context
# Conceptual example - replace with actual model loading and inference
def load_model():
# Simulate loading a large model
time.sleep(2)
return "model_object"
def run_inference(model, inputs):
# Simulate inference memory spike
_ = torch.randn(1024, 1024, device='cuda' if torch.cuda.is_available() else 'cpu') # Example allocation
time.sleep(0.1)
return "output"
process = psutil.Process(os.getpid())
# Measure memory before loading model
mem_before_load = process.memory_info().rss / (1024 * 1024) # MB
print(f"Memory before load: {mem_before_load:.2f} MB")
if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()
vram_before_load = torch.cuda.memory_allocated() / (1024 * 1024) # MB
print(f"VRAM before load: {vram_before_load:.2f} MB")
# Load model and measure memory after load
model = load_model()
mem_after_load = process.memory_info().rss / (1024 * 1024) # MB
print(f"Memory after load: {mem_after_load:.2f} MB")
if torch.cuda.is_available():
vram_after_load = torch.cuda.memory_allocated() / (1024 * 1024) # MB
print(f"VRAM after load: {vram_after_load:.2f} MB")
# Run inference and measure peak memory during inference
_ = run_inference(model, "sample_input")
mem_peak_inference = process.memory_info().rss / (1024 * 1024) # MB (snapshot after run)
print(f"Memory after inference run: {mem_peak_inference:.2f} MB")
if torch.cuda.is_available():
peak_vram_inference = torch.cuda.max_memory_allocated() / (1024 * 1024) # MB
print(f"Peak VRAM during inference: {peak_vram_inference:.2f} MB")
# Note: For precise peak RAM, might need monitoring in a separate thread/process
# during the run_inference call itself, as 'mem_peak_inference' here is just a snapshot after.
Lower bit precision reduces the memory needed for both weights and activations, allowing larger models to fit on the same hardware or the same model to run on less powerful hardware.
Peak GPU VRAM consumption during a sample inference task for different quantization levels. Lower values enable running models on GPUs with less memory.
When comparing the performance of a quantized model against its baseline or other quantized versions, consistency is important:
By carefully measuring speed and memory usage under controlled conditions, you can accurately quantify the efficiency gains from quantization. These measurements, combined with the accuracy evaluations discussed previously, provide the data needed to make informed decisions about the trade-offs involved and choose the best quantized model for your specific deployment scenario.
© 2025 ApX Machine Learning