Serving SLMs with vLLM

Making a model available for inference is a primary objective after saving merged weights in the Safetensors format. While loading the weights into a standard PyTorch and Transformers script works for local testing, it is highly inefficient for handling multiple simultaneous requests. Standard inference scripts suffer from low throughput and high latency because they do not manage memory efficiently during continuous text generation. To serve the model effectively in a production environment, you need a specialized inference engine.

vLLM is an open source library designed for high throughput and memory efficient language model serving. It improves inference speed significantly compared to standard Hugging Face pipelines. The primary mechanism behind vLLM is PagedAttention, an algorithm that solves the memory bottleneck typically associated with text generation.

The Mechanics of PagedAttention

During text generation, autoregressive models generate tokens one by one. To avoid recomputing the attention scores for past tokens, the model stores their key and value vectors in GPU memory. This stored data is known as the KV cache.

In standard implementations, the KV cache is allocated contiguously in memory. Because the exact length of the generated sequence is unknown beforehand, systems usually over-allocate memory just in case the sequence grows to the maximum limit. This leads to heavy memory fragmentation and wasted VRAM.

PagedAttention solves this by managing the KV cache like an operating system manages virtual memory. It divides the sequences into fixed size blocks. These blocks do not need to be contiguous in physical memory. This allows the system to allocate memory dynamically as the sequence grows, nearly eliminating waste and allowing the server to batch many more requests together.

Memory allocation using PagedAttention where input prompts are dynamically assigned to non-contiguous KV cache blocks.

Installing and Starting the Server

Installing vLLM is straightforward using the standard Python package manager.

pip install vllm

Once installed, vLLM can be run as a standalone server or imported as a Python module. For a deployed application, running it as an OpenAI compatible server is the most practical approach. This compatibility means that any client or library designed to work with the OpenAI API will automatically work with your locally hosted small language model.

You can start the vLLM server directly from the command line. Point it to the directory containing your merged Safetensors model.

python -m vllm.entrypoints.openai.api_server --model ./my-merged-slm --host 0.0.0.0 --port 8000

There are several important parameters you can configure to optimize the server for your hardware constraints:

gpu-memory-utilization: This defines the fraction of GPU memory vLLM is allowed to allocate. By default, it is set to $0.90$ . If you are running other processes on the same GPU, you should lower this value, such as setting it to $0.50$ .
max-model-len: This restricts the maximum context length, which includes both input and output tokens. If your model was fine-tuned on short instructions, you can lower this value to save memory. For example, setting it to $1024$ reduces the initial memory reservation.
tensor-parallel-size: If you have multiple GPUs, this parameter allows you to split the model execution across them. For a small language model, a single GPU is usually sufficient, so this defaults to $1$ .

Interacting with the Server

With the server running, it listens for incoming HTTP requests on port 8000. You can interact with it using a standard curl command. Because it mimics the OpenAI API structure, the request format relies on standard JSON fields.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./my-merged-slm",
    "prompt": "Explain the difference between a list and a tuple in Python.",
    "max_tokens": 150,
    "temperature": 0.7
  }'

The server processes the request, applies the PagedAttention optimizations under the hood, and returns a JSON response containing the generated text.

Offline Inference for Batch Processing

Sometimes you do not need a persistent server but want to process a large batch of data locally with maximum throughput. vLLM provides an LLM class for offline inference that handles this natively.

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="./my-merged-slm", gpu_memory_utilization=0.8)

# Define generation parameters
sampling_params = SamplingParams(temperature=0.7, max_tokens=150)

# List of prompts to process
prompts = [
    "Write a short function to calculate the factorial of a number.",
    "What are the benefits of using a dictionary in Python?"
]

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}\nOutput: {generated_text}\n")

This offline approach bypasses the network overhead entirely. The generate method automatically batches the prompts, maximizing GPU utilization and reducing the total time required to process large datasets. You now have a fast and scalable mechanism to serve your customized language model.

References

Efficient Memory Management for Large Language Model Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23) (ACM) DOI: 10.1145/3600006.3613165 - The original research paper describing the PagedAttention algorithm and its implementation in the vLLM engine.
vLLM Documentation, vLLM Team, 2024 - Official documentation covering installation, configuration parameters, and OpenAI-compatible server deployment.