Now that we've discussed the theoretical underpinnings of hardware mapping, memory management, optimized kernels, and compiler techniques, let's put these concepts into practice. This section provides hands-on experience using specialized inference runtimes to accelerate LLM performance on target hardware. Theoretical model optimization (like quantization or pruning) achieves its full potential only when paired with a runtime capable of efficiently executing the optimized model graph on the hardware. We will explore how runtimes like ONNX Runtime, NVIDIA TensorRT, and vLLM translate optimized models into faster execution.
Our goal is to take a pre-trained transformer model and measure its inference performance using different runtime environments, comparing them to a baseline implementation in a standard deep learning framework like PyTorch. We will focus on GPU acceleration, a common scenario for LLM deployment.
Let's assume we have a moderately sized transformer model (e.g., a distilled version of GPT-2 or a similar architecture, potentially already quantized to INT8 using methods from Chapter 2) available in PyTorch format. Our first step is to establish a baseline performance metric. This typically involves loading the model in PyTorch and running inference on a representative batch of input data on our target GPU, carefully measuring latency or throughput.
# Placeholder: Baseline PyTorch Inference Measurement
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assume model and tokenizer are loaded and on GPU
# model = AutoModelForCausalLM.from_pretrained(...)
# tokenizer = AutoTokenizer.from_pretrained(...)
# model.to('cuda')
# model.eval()
# Example input
# inputs = tokenizer("Example input text", return_tensors="pt").to('cuda')
# input_ids = inputs['input_ids']
# --- Baseline Measurement ---
num_runs = 50
warmup_runs = 5
latencies = []
with torch.no_grad():
# Warmup
for _ in range(warmup_runs):
_ = model.generate(input_ids, max_new_tokens=50) # Or model(**inputs) for classification
# Timed runs
for _ in range(num_runs):
torch.cuda.synchronize() # Ensure accurate timing on GPU
start_time = time.perf_counter()
_ = model.generate(input_ids, max_new_tokens=50) # Or model(**inputs)
torch.cuda.synchronize()
end_time = time.perf_counter()
latencies.append((end_time - start_time) * 1000) # milliseconds
baseline_avg_latency = sum(latencies) / num_runs
print(f"Baseline PyTorch Avg. Latency: {baseline_avg_latency:.2f} ms")
This baseline provides a reference point against which we can evaluate the improvements offered by specialized runtimes.
ONNX Runtime (ORT) is a cross-platform inference and training accelerator that supports models from various frameworks (PyTorch, TensorFlow, scikit-learn, etc.) through the Open Neural Network Exchange (ONNX) format. ORT applies graph optimizations and leverages hardware-specific acceleration libraries called Execution Providers (EPs).
Model Conversion: The first step is to export our PyTorch model to the ONNX format. The torch.onnx.export
function facilitates this. We need to provide an example input to trace the model's execution graph.
# Placeholder: Export PyTorch model to ONNX
# Assuming 'model' and 'dummy_input' (matching expected input shape) are defined
# dummy_input = tokenizer("Example input text", return_tensors="pt")['input_ids'].to('cuda')
torch.onnx.export(model, # model being run
dummy_input, # model input (or tuple for multiple inputs)
"model.onnx", # where to save the model
export_params=True, # store the trained parameter weights inside the model file
opset_version=14, # the ONNX version to export the model to
do_constant_folding=True, # execute constant folding for optimization
input_names = ['input_ids'], # specify input names
output_names = ['output'], # specify output names
# dynamic_axes={'input_ids' : {0 : 'batch_size', 1: 'sequence'}, # Makes axes dynamic
# 'output' : {0 : 'batch_size'}})
print("Model exported to model.onnx")
Note: Handling dynamic axes is important for variable input sizes (batch, sequence length) but adds complexity. For simplicity, we might start with fixed shapes.
Inference with ONNX Runtime: We can now load the ONNX model and run inference using ORT, specifying the desired Execution Provider. For GPU acceleration, we use the CUDAExecutionProvider
.
# Placeholder: ONNX Runtime Inference
import onnxruntime as ort
import numpy as np
# Load the ONNX model
sess_options = ort.SessionOptions()
# Optional: Enable graph optimizations
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
ort_session = ort.InferenceSession("model.onnx", sess_options, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
# Prepare input (convert PyTorch tensor to NumPy)
# numpy_input = inputs['input_ids'].cpu().numpy()
ort_inputs = {ort_session.get_inputs()[0].name: numpy_input}
# --- ONNX Runtime Measurement ---
ort_latencies = []
# Warmup
for _ in range(warmup_runs):
_ = ort_session.run(None, ort_inputs)
# Timed runs
for _ in range(num_runs):
start_time = time.perf_counter()
_ = ort_session.run(None, ort_inputs) # ORT handles GPU sync implicitly often
end_time = time.perf_counter()
ort_latencies.append((end_time - start_time) * 1000)
ort_avg_latency = sum(ort_latencies) / num_runs
print(f"ONNX Runtime (CUDA EP) Avg. Latency: {ort_avg_latency:.2f} ms")
ORT automatically applies various graph optimizations (operator fusion, constant folding) and uses optimized CUDA kernels provided by the CUDA EP, often resulting in significant speedups over the baseline PyTorch implementation. If the model exported was already quantized, ORT can leverage specialized INT8 kernels for further gains.
NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime specifically for NVIDIA GPUs. It performs aggressive graph optimizations, layer fusion, kernel auto-tuning, and precision calibration (FP32, FP16, INT8) to maximize utilization of the underlying hardware, including Tensor Cores.
Building the TensorRT Engine: TensorRT typically ingests an ONNX model (or integrates with frameworks like TensorFlow/PyTorch). The core step is building a TensorRT "engine," which is a highly optimized version of the model specific to the target GPU and TensorRT configuration (e.g., precision mode). This build process can take time as TensorRT explores optimal kernels and configurations.
# Placeholder: Building TensorRT engine (often done via 'trtexec' command-line tool or Python API)
# Using trtexec (command-line example):
# trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
# Using Python API (simplified conceptual flow):
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
parser = trt.OnnxParser(network, TRT_LOGGER)
# Parse the ONNX model
with open("model.onnx", "rb") as model_file:
if not parser.parse(model_file.read()):
print("ERROR: Failed to parse the ONNX file.")
for error in range(parser.num_errors):
print(parser.get_error(error))
# Handle error appropriately
print("Completed parsing ONNX file")
# Configure the builder (e.g., enable FP16, INT8, set workspace)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 * (1024**3)) # 4GB workspace
if builder.platform_has_fast_fp16:
config.set_flag(trt.BuilderFlag.FP16)
print("FP16 enabled")
# Build the serialized engine
# plan = builder.build_serialized_network(network, config)
# if not plan:
# print("ERROR: Failed to build engine.")
# # Handle error
#
# with open("model.engine", "wb") as f:
# f.write(plan)
# print("TensorRT engine saved to model.engine")
Note: TensorRT engine building requires careful handling of dynamic shapes, calibration for INT8 precision, and sufficient workspace memory.
TensorRT Inference: Once the engine is built, inference involves loading it and executing it with input data. This can be done using the TensorRT Python/C++ API or often more conveniently via ONNX Runtime's TensorRT Execution Provider, which handles engine management behind the scenes.
# Placeholder: Inference using ONNX Runtime with TensorRT EP
import onnxruntime as ort
import numpy as np
# Ensure TensorRT libraries are accessible in the environment
ort_session_trt = ort.InferenceSession("model.onnx", # Load from ONNX, ORT builds/caches TRT engine
providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'],
provider_options=[
{'device_id': 0,
'trt_fp16_enable': True, # Enable FP16 precision
'trt_engine_cache_enable': True,
'trt_engine_cache_path': './trt_cache'}, # Cache directory
{}, {} # Options for CUDA EP and CPU EP
])
# Prepare input (as before)
# ort_inputs = {ort_session_trt.get_inputs()[0].name: numpy_input}
# --- TensorRT Measurement (via ORT) ---
trt_latencies = []
# Warmup
for _ in range(warmup_runs):
_ = ort_session_trt.run(None, ort_inputs)
# Timed runs
for _ in range(num_runs):
start_time = time.perf_counter()
_ = ort_session_trt.run(None, ort_inputs)
end_time = time.perf_counter()
trt_latencies.append((end_time - start_time) * 1000)
trt_avg_latency = sum(trt_latencies) / num_runs
print(f"ONNX Runtime (TensorRT EP) Avg. Latency: {trt_avg_latency:.2f} ms")
TensorRT often yields the highest performance on NVIDIA GPUs due to its hardware-specific optimizations, especially when using reduced precision like FP16 or INT8.
For text generation tasks, standard batching methods can be inefficient because sequences in a batch finish at different times, leading to wasted computation on padded tokens. vLLM is a library specifically designed to address this for LLM inference. Its core innovation is PagedAttention, which manages the memory for attention keys and values much more efficiently, similar to how virtual memory and paging work in operating systems. This allows for near-optimal memory usage, reduces fragmentation, and enables continuous batching, where new sequences are added to the batch as soon as others finish.
Setup and Usage: vLLM provides a high-level Python API that often allows loading models directly from sources like the Hugging Face Hub.
# Placeholder: vLLM Inference for Text Generation
# Ensure vllm is installed: pip install vllm
from vllm import LLM, SamplingParams
# Load the model (vLLM handles optimized loading)
# Use the same model identifier as the baseline
# llm = LLM(model="gpt2", # Replace with your model
# tensor_parallel_size=1) # Adjust based on available GPUs
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=50)
# Prepare prompts
prompts = ["Example input text", "Another prompt for batching"]
# --- vLLM Measurement ---
vllm_latencies = [] # Measure latency per request or total time for batch
# Warmup might be handled internally or require a few initial runs
start_time = time.perf_counter()
outputs = llm.generate(prompts, sampling_params)
end_time = time.perf_counter()
total_time_ms = (end_time - start_time) * 1000
avg_latency_per_seq = total_time_ms / len(prompts)
# Throughput calculation would depend on token counts and time
print(f"vLLM Avg. Latency per sequence: {avg_latency_per_seq:.2f} ms")
# For a fair comparison, measure throughput (tokens/sec)
# total_output_tokens = sum(len(output.outputs[0].token_ids) for output in outputs)
# throughput_tok_sec = total_output_tokens / (total_time_ms / 1000)
# print(f"vLLM Throughput: {throughput_tok_sec:.2f} tokens/sec")
# Print generated text (optional)
# for output in outputs:
# prompt = output.prompt
# generated_text = output.outputs[0].text
# print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
vLLM excels in throughput for generation tasks, especially with variable output lengths, due to PagedAttention and continuous batching minimizing idle GPU time.
Accurate benchmarking is fundamental. Always perform warm-up runs before measuring, run multiple iterations to average results, and measure relevant metrics (e.g., latency at different percentiles like p50, p90, p99, and throughput in tokens/second or requests/second). Ensure you are measuring under realistic load conditions if building a serving system.
Let's visualize hypothetical results comparing these runtimes:
Hypothetical comparison of inference performance for a text generation task across different runtimes. Lower latency and higher throughput indicate better performance. vLLM often shows substantial throughput gains for generation due to optimized memory management and batching.
Analysis:
Moving beyond theoretical model compression, leveraging specialized inference runtimes is essential for deploying performant LLMs. Tools like ONNX Runtime, TensorRT, and vLLM implement many of the hardware-aware optimizations discussed in this chapter, such as kernel fusion, optimized memory management (like PagedAttention), efficient hardware mapping, and support for reduced precision. By converting models and executing them within these optimized environments, significant reductions in latency and increases in throughput can be achieved. This practical step bridges the gap between a compressed model file and an efficient, deployable AI application. Experimenting with these runtimes and their configuration options is a standard part of the LLM deployment workflow.
© 2025 ApX Machine Learning