Practice: Deploying the Custom Model Locally

Deploying a fine-tuned Small Language Model on a local server converts static Safetensors files (containing merged LoRA adapters and base model weights) into a live service. Using vLLM, the model is hosted locally to expose a RESTful API. A Python client then submits inference requests to the service.

vLLM provides an OpenAI-compatible API server out of the box. This compatibility means you can use existing client libraries designed for large-scale commercial models to interact directly with your local deployment. To start the server, open your terminal and execute the following command, replacing the file path with the location of your merged Safetensors model.

python -m vllm.entrypoints.openai.api_server --model /path/to/your/merged_model --port 8000

When you run this command, vLLM allocates the model weights into your GPU memory. The engine initializes the KV cache and prepares the REST API endpoints to accept incoming connections. You will see terminal output indicating that the server is listening for HTTP traffic on port 8000.

Data flow from client application to the local inference engine and back.

With the server running, you can establish communication with your model. Because vLLM mimics the OpenAI API structure, you can use the official openai Python package to send requests. If you do not have it installed in your environment, you can add it via pip.

from openai import OpenAI

# Initialize the client pointing to the local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="local-deployment" # API key is required by the client but ignored by local vLLM
)

system_prompt = "You are a helpful assistant specialized in answering technical questions."
user_prompt = "Explain the difference between early stopping and weight decay in machine learning."

response = client.chat.completions.create(
    model="/path/to/your/merged_model",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

The base_url argument directs the client to your local machine address and the specific port assigned during the server startup. The model parameter must match the exact directory path or name you provided to the vLLM server command.

The API call accepts parameters that control text generation behavior. The temperature parameter scales the output logits before the softmax function is applied. A value of $0.7$ introduces some randomness, preventing the model from always selecting the most probable next token. The max_tokens argument restricts the total output length to prevent infinite loops or excessively long responses.

Given logits $z_i$ and temperature $T$ , the probability $p_i$ of the next token is computed mathematically as:

$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

When $T = 1.0$ , the standard softmax function applies normally. When $T < 1.0$ , the probability distribution becomes sharper, making the model output more deterministic.

For longer generations, waiting for the entire response to complete before displaying it can cause noticeable application latency. To improve the responsiveness of your application, you can stream tokens as they are generated. You achieve this behavior by setting stream=True in the API call.

response_stream = client.chat.completions.create(
    model="/path/to/your/merged_model",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    max_tokens=200,
    temperature=0.7,
    stream=True
)

for chunk in response_stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

In this streaming mode, the server uses Server-Sent Events to push token updates to the client continuously. The client script iterates over the incoming network chunks and prints them immediately to the console.

To verify that your deployment can handle multiple concurrent requests, you can write an asynchronous testing script. This confirms that the vLLM continuous batching mechanism functions correctly under load.

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="local-deployment"
)

async def fetch_response(prompt):
    response = await async_client.chat.completions.create(
        model="/path/to/your/merged_model",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "What is a gradient?",
        "Define cross-entropy loss.",
        "Explain the ReLU activation function."
    ]
    tasks = [fetch_response(p) for p in prompts]
    results = await asyncio.gather(*tasks)

    for i, result in enumerate(results):
        print(f"Response {i+1}: {result}\n")

asyncio.run(main())

Executing this asynchronous script sends all three prompts to the server simultaneously. The vLLM engine batches the incoming requests dynamically to maximize GPU memory bandwidth and returns the generated texts. You now have a fully operational local inference server hosting your fine-tuned language model.

References

Efficient Memory Management for Large Language Model Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23) (ACM) DOI: 10.1145/3600006.3613165 - The original research paper describing the PagedAttention algorithm and continuous batching techniques used by vLLM to optimize inference performance.
vLLM Documentation, vLLM Team, 2024 - Official documentation for deployment, server configuration, and supported model formats for the vLLM inference engine.
Speech and Language Processing, Daniel Jurafsky, James H. Martin, 2026 (Pearson) - A comprehensive textbook providing mathematical foundations for language modeling, including decoding strategies like temperature-scaled softmax. 3rd edition draft.
OpenAI Python Library, OpenAI, 2020 - Official repository and documentation for the Python client library used to interact with OpenAI-compatible APIs, supporting both synchronous and asynchronous requests.