After quantizing a Large Language Model, the primary goal is usually to accelerate inference speed and reduce resource consumption. While the previous section discussed metrics in general, this section focuses specifically on measuring two fundamental performance indicators: latency and throughput. Understanding how to accurately benchmark these aspects is essential for evaluating the effectiveness of your quantization efforts and making informed decisions about deployment.
Latency refers to the time it takes to process a single inference request. For LLMs generating text, this can be nuanced:
Throughput measures the rate at which the system can process inference requests. It's typically expressed in:
Optimizing for latency often focuses on minimizing the processing time for a single request, possibly at the expense of overall system utilization. Optimizing for throughput aims to maximize the number of requests or tokens processed concurrently, which might involve techniques like batching that can slightly increase the latency for individual requests. Quantization directly impacts the computational speed of operations within the model, influencing both latency and throughput.
Measuring latency accurately requires careful consideration of the measurement boundaries and potential sources of noise.
Isolate Inference Time: Ideally, you want to measure the time spent purely within the model's forward pass. This often excludes data preprocessing (tokenization) and post-processing (detokenization), although end-to-end latency measurements including these steps are also valuable for understanding the full user experience.
Warm-up Runs: Execute several inference requests before starting measurements. The initial runs might incur overheads like model loading, kernel compilation (especially on GPUs), or cache population, which do not reflect the steady-state performance. Discard the timings from these warm-up iterations.
Multiple Measurements and Statistics: Single timings can be noisy. Run the inference for the same input multiple times (e.g., 100 or 1000 iterations) and calculate descriptive statistics:
Hardware-Specific Timing:
time.perf_counter()
for high-resolution timing.time.perf_counter()
are insufficient for timing GPU operations due to their asynchronous nature. The CPU might finish submitting work long before the GPU completes it. Use GPU-specific synchronization mechanisms. For instance, in PyTorch:import torch
import time
# Assume model and input_data are already on the target GPU device
# model = model.to('cuda')
# input_data = input_data.to('cuda')
# Warm-up runs
for _ in range(10):
_ = model(input_data)
torch.cuda.synchronize() # Ensure warm-up is complete
# Using CUDA events for accurate timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
num_runs = 100
latencies = []
for _ in range(num_runs):
start_event.record()
_ = model(input_data) # The operation to time
end_event.record()
# Wait for GPU operations to complete
torch.cuda.synchronize()
# Calculate elapsed time in milliseconds
latency_ms = start_event.elapsed_time(end_event)
latencies.append(latency_ms)
avg_latency = sum(latencies) / num_runs
print(f"Average latency over {num_runs} runs: {avg_latency:.3f} ms")
# Calculate P95, P99 etc. if needed using numpy or similar
# import numpy as np
# p95_latency = np.percentile(latencies, 95)
# print(f"P95 latency: {p95_latency:.3f} ms")
This torch.cuda.Event
approach accurately measures the time elapsed on the GPU between the two recorded points. Always remember torch.cuda.synchronize()
to ensure the CPU waits for the GPU to finish before recording the end time or calculating statistics.
Throughput measurement typically involves simulating concurrent requests or processing batches of inputs.
This chart demonstrates how throughput typically increases with batch size for both baseline (FP16) and quantized (INT8) models, with the quantized model achieving higher throughput, especially at larger batch sizes. Saturation may occur at even larger batch sizes not shown here.
Be aware that latency and throughput are not fixed numbers; they depend heavily on:
bitsandbytes
, AutoGPTQ
), inference server (e.g., vLLM
, TensorRT-LLM
, TGI
), CUDA version, driver versions. Optimized kernels in frameworks like TensorRT-LLM can drastically improve performance for specific quantization formats (like INT4 or INT8) on supported hardware.When reporting results, always specify the complete hardware and software environment, the model used, the quantization method, and the exact workload parameters (input/output length, batch size, concurrency level) to ensure reproducibility and fair comparison. Consistent and rigorous measurement is fundamental to understanding the real-world benefits and trade-offs of deploying quantized LLMs.
© 2025 ApX Machine Learning