After quantizing a Large Language Model, the primary goal is usually to accelerate inference speed and reduce resource consumption. While the previous section discussed metrics in general, this section focuses specifically on measuring two fundamental performance indicators: latency and throughput. Understanding how to accurately benchmark these aspects is essential for evaluating the effectiveness of your quantization efforts and making informed decisions about deployment.

Defining Latency and Throughput for LLMs

Latency refers to the time it takes to process a single inference request. For LLMs generating text, this can be nuanced:

Time To First Token (TTFT): The duration from sending the input prompt to receiving the very first generated token. This is often critical for interactive applications where users expect a quick response.
Time Per Output Token (TPOT): The average time taken to generate each subsequent token after the first one. This indicates the sustained generation speed.
End-to-End Latency: The total time from request initiation to receiving the complete generated sequence. This includes TTFT and the time spent generating all subsequent tokens.

Throughput measures the rate at which the system can process inference requests. It's typically expressed in:

Requests Per Second (RPS): The number of independent inference requests completed within one second.
Tokens Per Second (TPS): The total number of output tokens generated by the system across all concurrent requests per second. This is often more relevant for LLMs, especially when dealing with variable output lengths and batching.

Optimizing for latency often focuses on minimizing the processing time for a single request, possibly at the expense of overall system utilization. Optimizing for throughput aims to maximize the number of requests or tokens processed concurrently, which might involve techniques like batching that can slightly increase the latency for individual requests. Quantization directly impacts the computational speed of operations within the model, influencing both latency and throughput.

Techniques for Measuring Latency

Measuring latency accurately requires careful consideration of the measurement boundaries and potential sources of noise.

Isolate Inference Time: Ideally, you want to measure the time spent purely within the model's forward pass. This often excludes data preprocessing (tokenization) and post-processing (detokenization), although end-to-end latency measurements including these steps are also valuable for understanding the full user experience.
Warm-up Runs: Execute several inference requests before starting measurements. The initial runs might incur overheads like model loading, kernel compilation (especially on GPUs), or cache population, which do not reflect the steady-state performance. Discard the timings from these warm-up iterations.
Multiple Measurements and Statistics: Single timings can be noisy. Run the inference for the same input multiple times (e.g., 100 or 1000 iterations) and calculate descriptive statistics:
- Average: Provides a general sense of performance.
- Median (P50): Less sensitive to outliers than the average.
- Percentiles (P95, P99): Indicate worst-case latency experienced by the majority of requests, important for Service Level Agreements (SLAs).

Hardware-Specific Timing:

CPU: Use standard Python libraries like time.perf_counter() for high-resolution timing.
GPU: CPU-based timers like time.perf_counter() are insufficient for timing GPU operations due to their asynchronous nature. The CPU might finish submitting work long before the GPU completes it. Use GPU-specific synchronization mechanisms. For instance, in PyTorch:

import torch
import time

# Assume model and input_data are already on the target GPU device
# model = model.to('cuda')
# input_data = input_data.to('cuda')

# Warm-up runs
for _ in range(10):
    _ = model(input_data)
torch.cuda.synchronize() # Ensure warm-up is complete

# Using CUDA events for accurate timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

num_runs = 100
latencies = []

for _ in range(num_runs):
    start_event.record()
    _ = model(input_data) # The operation to time
    end_event.record()

    # Wait for GPU operations to complete
    torch.cuda.synchronize()

    # Calculate elapsed time in milliseconds
    latency_ms = start_event.elapsed_time(end_event)
    latencies.append(latency_ms)

avg_latency = sum(latencies) / num_runs
print(f"Average latency over {num_runs} runs: {avg_latency:.3f} ms")

# Calculate P95, P99 etc. if needed using numpy or similar
# import numpy as np
# p95_latency = np.percentile(latencies, 95)
# print(f"P95 latency: {p95_latency:.3f} ms")

This torch.cuda.Event approach accurately measures the time elapsed on the GPU between the two recorded points. Always remember torch.cuda.synchronize() to ensure the CPU waits for the GPU to finish before recording the end time or calculating statistics.

Techniques for Measuring Throughput

Throughput measurement typically involves simulating concurrent requests or processing batches of inputs.

Concurrent Requests: Use multiple client threads or processes sending requests simultaneously to the inference server hosting the quantized model. Measure the total number of successfully completed requests over a fixed time period (e.g., 60 seconds). Be mindful of client-side bottlenecks; ensure the load generation tool itself isn't limiting the throughput.
Batch Processing: Increase the batch size (number of input sequences processed together in one forward pass) and measure the time taken. Throughput (RPS or TPS) can be calculated as: $\text{Throughput} = \frac{\text{Batch Size}}{\text{Time per Batch}}$ or for token-based throughput: $\text{Throughput (TPS)} = \frac{\text{Batch Size} \times \text{Average Output Tokens per Request}}{\text{Time per Batch}}$ Plotting throughput against batch size often reveals an optimal batch size where throughput saturates or even decreases due to memory constraints or computational overheads.
Iterative Benchmarking: Start with a low load (e.g., few concurrent users or small batch size) and gradually increase it, measuring latency and throughput at each step. This helps identify the system's breaking point and understand the latency-throughput trade-off.

This chart demonstrates how throughput typically increases with batch size for both baseline (FP16) and quantized (INT8) models, with the quantized model achieving higher throughput, especially at larger batch sizes. Saturation may occur at even larger batch sizes not shown here.

Factors Influencing Measurements

Be aware that latency and throughput are not fixed numbers; they depend heavily on:

Hardware: CPU type, GPU model (compute capability, memory bandwidth), available RAM.
Software Stack: The specific quantization library (e.g., bitsandbytes, AutoGPTQ), inference server (e.g., vLLM, TensorRT-LLM, TGI), CUDA version, driver versions. Optimized kernels in frameworks like TensorRT-LLM can drastically improve performance for specific quantization formats (like INT4 or INT8) on supported hardware.
Model Characteristics: Architecture, size (number of parameters).
Quantization Details: Precision (INT8, INT4, NF4), quantization scheme (per-tensor, per-channel), specific algorithm (GPTQ, AWQ).
Workload: Input sequence length, output sequence length (for generation), batch size, number of concurrent requests. Longer sequences generally increase latency.

When reporting results, always specify the complete hardware and software environment, the model used, the quantization method, and the exact workload parameters (input/output length, batch size, concurrency level) to ensure reproducibility and fair comparison. Consistent and rigorous measurement is fundamental to understanding the real-world benefits and trade-offs of deploying quantized LLMs.