Masterclass
While a simple web server using frameworks like Flask or FastAPI might suffice for deploying smaller machine learning models, serving large language models (LLMs) effectively presents a different set of engineering problems. The sheer size of LLM parameters strains memory resources, autoregressive generation is latency-sensitive, and handling concurrent user requests efficiently requires sophisticated batching and resource management. Standard web servers are not optimized for these GPU-intensive, stateful inference workloads. This is where specialized model serving frameworks become indispensable.
These frameworks are designed explicitly for deploying machine learning models at scale, providing optimized performance, better hardware utilization, and operational robustness needed for LLMs. They abstract away much of the underlying complexity of managing inference requests, model loading, hardware acceleration, and concurrency. Let's examine two prominent examples: NVIDIA Triton Inference Server and PyTorch TorchServe.
NVIDIA Triton is a high-performance inference serving platform designed to deploy models from various frameworks (including PyTorch, TensorFlow, ONNX Runtime, TensorRT, and custom backends) on both GPUs and CPUs. Its architecture is geared towards maximizing throughput and hardware utilization, making it a strong candidate for demanding LLM workloads.
Important features relevant for LLM serving include:
libtorch
). For maximum performance, especially on NVIDIA GPUs, models can often be converted to TensorRT and served via the TensorRT backend, potentially offering lower latency and higher throughput.Triton uses a declarative configuration approach. You typically define a model repository structure where each model has a config.pbtxt
file specifying its platform, backend, input/output tensors, version policy, and instance group settings (controlling how many instances run on which devices). Dynamic batching is also configured here.
Here's a simplified example of a config.pbtxt
for a hypothetical PyTorch LLM using dynamic batching:
# config.pbtxt for a hypothetical PyTorch LLM model in Triton
name: "my_llm_model"
platform: "pytorch_libtorch" # Specify the backend
max_batch_size: 64 # Maximum batch size Triton can form
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # Variable sequence length dimension
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ] # Variable sequence length dimension
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1, 50257 ] # Example: sequence length, vocabulary size
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8, 16, 32 ] # Batches Triton prefers to form
max_queue_delay_microseconds: 10000 # Max time (10ms) a request waits to be batched
}
instance_group [
{
count: 1 # Number of instances of this model
kind: KIND_GPU
gpus: [ 0 ] # Assign to GPU 0
}
]
# Optional: Specify the default model filename (if not model.pt)
# default_model_filename: "my_llm_scripted.pt"
Basic interaction flow within the NVIDIA Triton Inference Server.
Triton's strength lies in its broad compatibility, performance optimization features like dynamic batching and TensorRT integration, and robust handling of concurrent requests, making it well-suited for large-scale, multi-model deployments.
TorchServe is an open-source model serving framework developed specifically for PyTorch models. It aims to provide an easy and performant way to deploy PyTorch models into production environments. Being PyTorch-native, it often offers a more straightforward path for teams already heavily invested in the PyTorch ecosystem.
Important features relevant for LLM serving include:
torch-model-archiver
tool.torch-model-archiver
utility packages your model code (e.g., model.py
), serialized weights (.pt
or .pth
file), and a custom handler file (handler.py
) into a single .mar
(Model Archive) file, which is the unit of deployment for TorchServe.BaseHandler
) in Python to specify the exact pre-processing (e.g., tokenization), inference call, and post-processing (e.g., detokenization, generating text from logits) logic. This gives you fine-grained control over the request lifecycle directly in Python.Deploying an LLM with TorchServe typically involves creating a custom handler to manage the tokenization and generation process.
Here's a snippet of what a custom handler (handler.py
) for a generative LLM might look like:
# handler.py (example for TorchServe)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import logging
import os
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(__name__)
class LLMHandler(BaseHandler):
def __init__(self):
super().__init__()
self.initialized = False
self.tokenizer = None
self.model = None
self.device = None
def initialize(self, context):
"""
Load model and tokenizer. Called once when the model is loaded.
"""
properties = context.system_properties
model_dir = properties.get("model_dir")
# Determine device based on CUDA availability
use_cuda = (
torch.cuda.is_available()
and properties.get("gpu_id") is not None
)
if use_cuda:
self.device = torch.device(
"cuda:" + str(properties.get("gpu_id"))
)
else:
self.device = torch.device("cpu")
logger.info(f"Loading model onto device: {self.device}")
# Load tokenizer and model from the model directory
self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
self.model = AutoModelForCausalLM.from_pretrained(model_dir)
self.model.to(self.device)
self.model.eval() # Set model to evaluation mode
# Add padding token if missing (common for some models)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.model.config.pad_token_id = self.model.config.eos_token_id
logger.info(
"Transformer model and tokenizer loaded successfully."
)
self.initialized = True
def preprocess(self, requests):
"""
Tokenize input prompts. requests is a list of requests.
"""
input_texts = [req.get("data") or req.get("body") for req in requests]
# Decode if necessary (assuming JSON input with 'prompt' field)
prompts = []
for item in input_texts:
if isinstance(item, (bytes, bytearray)):
prompt = item.get("prompt", item.decode('utf-8'))
else:
prompt = item.get("prompt")
prompts.append(prompt)
logger.info(f"Received prompts: {prompts}")
# Tokenize batch of prompts
inputs = self.tokenizer(
prompts, return_tensors="pt", padding=True
).to(self.device)
# inputs is a dictionary containing 'input_ids' and 'attention_mask'
return inputs
def inference(self, inputs):
"""
Perform model inference (generation).
"""
# Example generation parameters (can be passed in request data)
max_new_tokens = inputs.pop("max_new_tokens", 50) # Default if not provided
do_sample = inputs.pop("do_sample", True)
temperature = inputs.pop("temperature", 0.7)
top_p = inputs.pop("top_p", 0.9)
with torch.no_grad():
# Model generation call
outputs = self.model.generate(
**inputs, # Pass input_ids and attention_mask
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id
)
# Tensor containing generated token IDs
return outputs
def postprocess(self, outputs):
"""
Detokenize the generated sequences.
"""
# Decode generated token IDs back to text
# Skip special tokens and decode only the newly generated part
generated_texts = self.tokenizer.batch_decode(
outputs, skip_special_tokens=True
)
logger.info(f"Generated texts: {generated_texts}")
# Return list of generated strings
return generated_texts
# To package this (assuming model weights are saved in ./my_llm_model/):
# $ torch-model-archiver --model-name my_llm \
# --version 1.0 \
# --serialized-file ./my_llm_model/pytorch_model.bin \
# --model-file ./my_llm_model/modeling_utils.py \
# --handler handler.py \
# --extra-files "./my_llm_model/{config,tokenizer,tokenizer_config,special_tokens_map}.json" \
# --export-path ./model_store
#
# $ torchserve --start --ncs --model-store ./model_store --models my_llm=my_llm.mar
TorchServe provides a streamlined path for deploying PyTorch models, offering flexibility through Python-based custom handlers and good integration with the PyTorch ecosystem's tools and practices.
The choice between Triton and TorchServe often depends on specific project needs and existing infrastructure:
config.pbtxt
).Both frameworks are capable of serving large models efficiently. They provide the necessary abstractions and optimizations (like batching and hardware acceleration integration) that are difficult and time-consuming to build from scratch, enabling engineering teams to focus on model development and application logic rather than low-level serving infrastructure. Deploying LLMs reliably and scalably requires moving beyond basic web servers to these specialized inference serving solutions.
© 2025 ApX Machine Learning