Masterclass
Once an LLM is trained and optimized, making it accessible to applications requires a well-defined Application Programming Interface (API). This API acts as the contract between the serving infrastructure and the client applications, specifying how requests should be formatted and what responses to expect. Designing this interface thoughtfully is important for usability, performance, and maintainability. Poor API design can lead to inefficient resource usage, difficult client integration, and poor user experiences.
Most LLM interactions happen over networks, making standard web protocols the typical choice.
For most general-purpose LLM serving, a well-designed REST/HTTP API provides a good balance of performance and accessibility.
A client needs to send the model sufficient information to generate text. A typical request payload might include:
max_new_tokens
: The maximum number of tokens to generate. Prevents runaway generation and controls response length.temperature
: Controls the randomness of the output. Lower values (e.g., 0.2) make the output more deterministic and focused, while higher values (e.g., 0.8) increase diversity and creativity. A value of 0 effectively makes greedy decoding the standard.top_p
(Nucleus Sampling): Specifies a probability threshold p. The model considers only the smallest set of tokens whose cumulative probability exceeds p. This often yields better results than temperature
alone. Values typically range from 0.8 to 1.0.top_k
: Limits sampling to the k most likely next tokens. Can be used with or instead of top_p
.repetition_penalty
: Penalizes tokens that have already appeared in the prompt or the generated sequence, discouraging repetitive output. A value > 1.0 increases the penalty.stop_sequences
: A list of strings that, if generated, signal the end of the response.Here's a possible JSON request body for a text completion API:
{
"prompt": "Explain the concept of Key-Value caching in LLM inference:",
"model": "llm-engine-v1.2",
"max_new_tokens": 250,
"temperature": 0.7,
"top_p": 0.9,
"stop_sequences": ["\n\n", "---"],
"stream": false
}
The response should provide the generated text and relevant metadata.
stream
is false
, the response is typically a single JSON object containing:
generated_text
: The complete output string from the model.finish_reason
: Indicates why the generation stopped (e.g., length
if max_new_tokens
was reached, stop_sequence
if a stop sequence was generated, eos_token
if the model naturally finished).usage
: Token counts (e.g., prompt_tokens
, completion_tokens
, total_tokens
). Useful for tracking costs or resource consumption.Example non-streaming response:
{
"id": "cmpl-xyz123",
"object": "text_completion",
"created": 1677652288,
"model": "llm-engine-v1.2",
"choices": [
{
"text": " Key-Value (KV) caching is a critical optimization technique used during the autoregressive decoding process in Transformer-based Large Language Models (LLMs). During generation, each new token depends on the attention calculations involving all previous tokens. The keys (K) and values (V) computed for previous tokens in the self-attention layers remain constant as new tokens are generated. Instead of recomputing these K and V tensors for the entire sequence at every step, KV caching stores them in memory (typically GPU HBM). When generating the next token, the model only needs to compute the K and V for the newest token and reuse the cached values for all preceding tokens. This significantly reduces the computational cost per generated token, making inference much faster, especially for long sequences.",
"index": 0,
"logprobs": null,
"finish_reason": "stop_sequence"
}
],
"usage": {
"prompt_tokens": 13,
"completion_tokens": 151,
"total_tokens": 164
}
}
stream
is true
, the server sends back a sequence of events, often using Server-Sent Events (SSE). Each event typically contains a chunk of the generated text. The final event includes the finish_reason
and usage
statistics. This allows the client to display the response progressively, improving perceived performance for users.Example sequence of SSE events (simplified):
event: token
data: {"text": " "}
event: token
data: {"text": "-Value"}
event: token
data: {"text": " (KV"}
event: token
data: {"text": ") caching"}
... many more token events ...
event: token
data: {"text": "."}
event: done
data: {"finish_reason": "stop_sequence", "usage": {"prompt_tokens": 13, "completion_tokens": 151, "total_tokens": 164}}
Flow of a streaming API request and response. The client initiates a request, and the service sends back text chunks as they are generated.
Generating text with LLMs can take significant time, from seconds to minutes, especially for long outputs.
Here's how a client might interact with these API endpoints using Python's requests
library:
Standard (Non-Streaming) Request:
import requests
import json
API_URL = "http://your-llm-service.com/generate"
API_KEY = "your_api_key_here" # Should be handled securely
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"prompt": "What is the capital of Malaysia?",
"max_new_tokens": 50,
"temperature": 0.5,
"stream": False
}
try:
response = requests.post(API_URL,
headers=headers,
json=payload,
timeout=30) # Set a timeout
response.raise_for_status() # Raise an exception for bad status
# codes (4xx or 5xx)
result = response.json()
generated_text = result['choices'][0]['text']
print(f"Generated Text: {generated_text}")
print(f"Usage: {result['usage']}")
except requests.exceptions.RequestException as e:
print(f"API Request failed: {e}")
except KeyError as e:
print(f"Failed to parse response: Missing {e}")
Streaming Request:
import requests
import json
API_URL = "http://your-llm-service.com/generate"
API_KEY = "your_api_key_here" # Should be handled securely
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Accept": "text/event-stream" # Indicate preference for SSE
}
payload = {
"prompt": "Write a short story about a robot learning to paint:",
"max_new_tokens": 300,
"temperature": 0.8,
"stream": True
}
try:
# Use stream=True in requests to handle the streaming connection
with requests.post(
API_URL,
headers=headers,
json=payload,
stream=True,
timeout=180
) as response:
response.raise_for_status()
print("Streaming response:")
full_response = ""
final_data = None
# Iterate over the stream line by line
for line in response.iter_lines():
if line:
decoded_line = line.decode('utf-8')
if decoded_line.startswith('event: token'):
# Get the next line which should be the data
data_line = next(response.iter_lines()).decode('utf-8')
if data_line.startswith('data:'):
try:
content = json.loads(data_line[len('data: '):])
text_chunk = content.get('text', '')
print(text_chunk, end='', flush=True)
full_response += text_chunk
except json.JSONDecodeError:
print(f"\nError decoding JSON: {data_line}")
elif decoded_line.startswith('event: done'):
data_line = next(response.iter_lines()).decode('utf-8')
if data_line.startswith('data:'):
try:
final_data = json.loads(
data_line[len('data: '):]
)
except json.JSONDecodeError:
print(f"\nError decoding final JSON: {data_line}")
break # Exit loop once done event is received
print("\n--- Stream finished ---")
if final_data:
print(f"Finish Reason: {final_data.get('finish_reason')}")
print(f"Usage: {final_data.get('usage')}")
else:
print("Final data not received.")
except requests.exceptions.RequestException as e:
print(f"\nAPI Request failed: {e}")
APIs define clear error responses. Use standard HTTP status codes:
200 OK
: Successful request (non-streaming or final streaming event).400 Bad Request
: Invalid input (e.g., malformed JSON, missing required parameters, invalid parameter values). Include a descriptive error message in the response body.401 Unauthorized
: Missing or invalid authentication credentials.403 Forbidden
: Authenticated user does not have permission.429 Too Many Requests
: Rate limiting exceeded. Include Retry-After
header if possible.500 Internal Server Error
: An unexpected error occurred on the server side during generation.503 Service Unavailable
: The service is temporarily overloaded or down for maintenance.LLM APIs, like any web service, must be secured. Implement appropriate authentication (e.g., API keys, OAuth 2.0) and authorization mechanisms to control access. Use HTTPS to encrypt traffic. Be mindful of potential prompt injection attacks if the prompts incorporate untrusted user input. Input validation is also a form of security, preventing malformed requests from causing issues deeper in the system.
Designing a clean, well-documented API is fundamental to making your LLM accessible and useful. Consider the needs of your client applications, the nature of LLM generation (potentially slow, variable length), and standard web practices when defining your interface.
Was this section helpful?
© 2025 ApX Machine Learning