Now that we understand the importance of monitoring LLM-specific metrics, let's put theory into practice. This section provides a hands-on guide to setting up basic monitoring for a simulated LLM inference endpoint, focusing on two fundamental performance indicators: latency and throughput. While we won't deploy a full-scale LLM here, the principles and techniques demonstrated are directly applicable to real-world scenarios.
We'll use Python with the FastAPI framework to create a simple web server simulating an LLM endpoint and the prometheus_client
library to instrument it, exposing metrics in a format that can be scraped by monitoring systems like Prometheus.
First, ensure you have FastAPI and Uvicorn (an ASGI server) installed, along with the Prometheus client library:
pip install fastapi uvicorn prometheus-client requests
Now, let's create a simple FastAPI application (main.py
) that mimics an LLM endpoint. It will include an artificial delay to simulate processing time and expose a /metrics
endpoint for Prometheus.
# main.py
import time
import random
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, Summary, generate_latest, REGISTRY
from prometheus_client import start_http_server # For standalone metric server if not using framework integration
# --- Prometheus Metrics ---
# Using Histogram for request latency (allows calculating quantiles)
REQUEST_LATENCY = Histogram(
'llm_request_latency_seconds',
'Latency of requests to the LLM endpoint',
['endpoint']
)
# Using Counter for total requests
REQUEST_COUNT = Counter(
'llm_request_total',
'Total number of requests to the LLM endpoint',
['endpoint', 'method', 'http_status']
)
# Using Summary for request latency (alternative, calculates quantiles client-side)
# REQUEST_LATENCY_SUMMARY = Summary(
# 'llm_request_latency_summary_seconds',
# 'Latency Summary of requests to the LLM endpoint',
# ['endpoint']
# )
app = FastAPI()
# Middleware to capture metrics for all requests
@app.middleware("http")
async def track_metrics(request: Request, call_next):
start_time = time.time()
endpoint = request.url.path
try:
response = await call_next(request)
status_code = response.status_code
except Exception as e:
status_code = 500 # Internal Server Error
raise e from None # Re-raise the exception
finally:
latency = time.time() - start_time
REQUEST_LATENCY.labels(endpoint=endpoint).observe(latency)
# REQUEST_LATENCY_SUMMARY.labels(endpoint=endpoint).observe(latency) # If using Summary
REQUEST_COUNT.labels(
endpoint=endpoint,
method=request.method,
http_status=status_code
).inc()
# Important: If an exception occurred before response is set,
# ensure we still record the request count with the error status.
# The try/finally block helps manage this. Note that if call_next
# raises an exception, the response object might not be available,
# hence setting status_code in the except block.
return response
@app.post("/predict")
async def predict(payload: dict):
"""Simulates an LLM prediction endpoint."""
# Simulate LLM processing time (e.g., 50ms to 500ms)
processing_time = random.uniform(0.05, 0.5)
time.sleep(processing_time)
# Simulate a simple response
input_text = payload.get("text", "")
response_text = f"Simulated response for: {input_text[:20]}..."
return {"prediction": response_text}
@app.get("/metrics")
async def metrics():
"""Exposes Prometheus metrics."""
from starlette.responses import Response
return Response(generate_latest(REGISTRY), media_type="text/plain")
# Optional: If running without Uvicorn's multiple workers or outside FastAPI context,
# you might start a separate server for metrics. Generally integrated is better.
# if __name__ == "__main__":
# start_http_server(8001) # Start Prometheus client server on port 8001
# # Run FastAPI app separately using uvicorn main:app --reload --port 8000
Explanation:
REQUEST_LATENCY
(a Histogram) and REQUEST_COUNT
(a Counter). Histograms are suitable for latency because they allow calculating quantiles (e.g., p95, p99) on the server side (Prometheus), which is often preferred for performance monitoring. Counters track cumulative totals.@app.middleware("http")
) to intercept every request. Before processing the request (call_next
), we record the start time. After processing, we calculate the duration (latency
) and observe it using our REQUEST_LATENCY
histogram. We also increment the REQUEST_COUNT
, labeling it with the endpoint, HTTP method, and status code./predict
endpoint simulates work using time.sleep()
./metrics
endpoint uses generate_latest(REGISTRY)
from the prometheus_client
library to return all registered metrics in the text format expected by Prometheus.Save the code above as main.py
and run it using Uvicorn:
uvicorn main:app --reload --port 8000
Your simulated LLM API is now running on http://localhost:8000
.
Now, let's send some requests to our /predict
endpoint. You can use a simple Python script (load_test.py
) or tools like curl
or hey
.
# load_test.py
import requests
import time
import random
import concurrent.futures
API_URL = "http://localhost:8000/predict"
METRICS_URL = "http://localhost:8000/metrics"
def send_request(i):
payload = {"text": f"This is test input number {i}."}
try:
start = time.time()
response = requests.post(API_URL, json=payload, timeout=2) # Add timeout
latency = time.time() - start
if response.status_code == 200:
print(f"Request {i}: Status {response.status_code}, Latency: {latency:.4f}s")
else:
print(f"Request {i}: Status {response.status_code}")
return True
except requests.exceptions.RequestException as e:
print(f"Request {i}: Failed - {e}")
return False
time.sleep(random.uniform(0.01, 0.1)) # Slight delay between requests
if __name__ == "__main__":
num_requests = 100
max_workers = 10 # Simulate concurrent users
print(f"Sending {num_requests} requests with {max_workers} workers...")
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(send_request, i) for i in range(num_requests)]
concurrent.futures.wait(futures)
print("\nLoad generation finished.")
print(f"Check metrics at: {METRICS_URL}")
Run the load generator:
python load_test.py
While the load generator is running (or after it finishes), access the metrics endpoint in your browser or using curl
: http://localhost:8000/metrics
or curl http://localhost:8000/metrics
.
You will see output similar to this (values will differ):
# HELP llm_request_latency_seconds Latency of requests to the LLM endpoint
# TYPE llm_request_latency_seconds histogram
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.005"} 0.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.01"} 0.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.025"} 0.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.05"} 2.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.075"} 10.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.1"} 25.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.25"} 68.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.5"} 100.0
llm_request_latency_seconds_bucket{endpoint="/predict",le="0.75"} 100.0
# ... more buckets ...
llm_request_latency_seconds_bucket{endpoint="/predict",le="+Inf"} 100.0
llm_request_latency_seconds_sum{endpoint="/predict"} 28.7345...
llm_request_latency_seconds_count{endpoint="/predict"} 100.0
# HELP llm_request_total Total number of requests to the LLM endpoint
# TYPE llm_request_total counter
llm_request_total{endpoint="/predict",http_status="200",method="POST"} 100.0
llm_request_total{endpoint="/metrics",http_status="200",method="GET"} 5.0
# ... other metrics ...
This text-based format is designed to be scraped by Prometheus.
In a full setup, you would configure Prometheus to periodically "scrape" (fetch) metrics from the /metrics
endpoint of your application instances. Grafana would then be configured with Prometheus as a data source to visualize these metrics.
Prometheus Configuration (Conceptual Snippet):
# prometheus.yml (example snippet)
scrape_configs:
- job_name: 'llm-service'
static_configs:
- targets: ['localhost:8000'] # Replace with your actual endpoint(s)
Example PromQL Queries (for Grafana or Prometheus UI):
Request Rate (per second) for the /predict endpoint:
rate(llm_request_total{endpoint="/predict", method="POST"}[5m])
(Calculates the per-second average rate of increase of the counter over the last 5 minutes)
95th Percentile Latency (p95) for the /predict endpoint:
histogram_quantile(0.95, sum(rate(llm_request_latency_seconds_bucket{endpoint="/predict"}[5m])) by (le))
(Calculates the 95th percentile latency based on the histogram buckets over the last 5 minutes)
Average Latency:
sum(rate(llm_request_latency_seconds_sum{endpoint="/predict"}[5m])) / sum(rate(llm_request_latency_seconds_count{endpoint="/predict"}[5m]))
Visualization:
You could create dashboards in Grafana displaying these metrics over time. For instance, a dashboard might show:
Basic monitoring pipeline: Clients interact with the instrumented API, which exposes metrics via an endpoint scraped by Prometheus. Grafana queries Prometheus to display visualizations on a dashboard.
Example visualization showing simulated p95 latency, average latency, and request throughput over time. Note the latency spike around 10:15, corresponding with a dip in throughput.
By observing these basic metrics, you can start answering important questions:
llm_request_total
) increasing? This signals backend problems.This practical demonstrated instrumenting an application for basic performance monitoring. Real-world LLMOps monitoring extends significantly further, incorporating:
Setting up foundational latency and throughput monitoring provides the initial visibility needed to operate LLMs reliably and forms the basis upon which more advanced observability can be built.
© 2025 ApX Machine Learning