Putting theory into practice is essential for mastering operational management. This section guides you through setting up a basic monitoring system for a deployed diffusion model inference service. We will leverage Prometheus for metrics collection and Grafana for visualization, a common stack in MLOps environments. We assume you have a containerized diffusion model API service running, perhaps similar to the one developed in the previous chapter's practical, accessible within your deployment environment (like Kubernetes or a VM).
The first step is to make your application expose relevant metrics in a format Prometheus can understand. We'll use the Prometheus exposition format. Most web frameworks have corresponding Prometheus client libraries. For a Python FastAPI application, the prometheus-fastapi-instrumentator
library is a convenient option, or you can use the standard prometheus_client
directly.
Let's instrument a hypothetical FastAPI application to expose core metrics:
Install necessary libraries:
pip install prometheus-fastapi-instrumentator prometheus-client
Modify your FastAPI application:
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
import time
import random
import uvicorn
# Assume 'generate_image' is your core diffusion model inference function
async def generate_image(prompt: str):
# Simulate generation time (replace with actual model call)
start_time = time.time()
processing_time = random.uniform(5.0, 15.0) # Simulate 5-15 seconds
time.sleep(processing_time)
# Simulate occasional errors
if random.random() < 0.05: # 5% error rate
raise ValueError("Simulated generation error")
end_time = time.time()
latency_ms = (end_time - start_time) * 1000
# In a real scenario, you'd also return the image data
return {"prompt": prompt, "latency_ms": latency_ms, "status": "success"}
app = FastAPI()
# Instrument the app
Instrumentator().instrument(app).expose(app)
@app.post("/generate")
async def handle_generate(payload: dict):
prompt = payload.get("prompt", "a default prompt")
try:
result = await generate_image(prompt)
return result
except Exception as e:
# You might want more sophisticated error handling/status codes
return {"error": str(e), "status": "failure"}, 500
# Add custom metrics if needed (Example: GPU Utilization)
# from prometheus_client import Gauge
# GPU_UTILIZATION = Gauge('gpu_utilization_percent', 'Current GPU Utilization (%)')
# You would need a separate process/thread to monitor nvidia-smi or similar
# and update this gauge periodically.
# Example: GPU_UTILIZATION.set(get_current_gpu_utilization())
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Running this application and accessing the /metrics
endpoint (e.g., http://localhost:8000/metrics
) will now show Prometheus metrics, including automatically instrumented ones like http_requests_total
, http_request_duration_seconds
, and potentially any custom metrics you added. The prometheus-fastapi-instrumentator
automatically provides latency histograms useful for calculating percentiles.
Prometheus needs to be configured to periodically "scrape" (fetch) metrics from your application's /metrics
endpoint. Assuming you have Prometheus running (e.g., as a Docker container or Kubernetes service), you'll update its configuration file (prometheus.yml
).
# prometheus.yml (example snippet)
global:
scrape_interval: 15s # How frequently to scrape targets
scrape_configs:
- job_name: 'diffusion-api'
static_configs:
- targets: ['<your-diffusion-api-ip-or-hostname>:8000'] # Replace with actual target address
# If running in Kubernetes, use service discovery instead of static_configs
# kubernetes_sd_configs:
# - role: endpoints
# reabel_configs: ... (to select the correct service/pods)
After updating the configuration, restart or reload Prometheus. It will begin collecting metrics from your service.
Grafana allows you to create dashboards to visualize the data stored in Prometheus.
Add Prometheus as a Data Source: In Grafana, navigate to Configuration -> Data Sources -> Add data source. Select Prometheus and enter the URL where your Prometheus server is accessible (e.g., http://prometheus:9090
).
Create a Dashboard: Create a new dashboard and add panels.
Request Rate Panel:
rate(http_requests_total{job="diffusion-api", handler="/generate"}[5m])
/generate
endpoint over the last 5 minutes.P95 Latency Panel:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="diffusion-api", handler="/generate"}[5m])) by (le))
/generate
. Adjust the handler label if your endpoint path differs.Error Rate Panel:
sum(rate(http_requests_total{job="diffusion-api", handler="/generate", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="diffusion-api", handler="/generate"}[5m]))
/generate
endpoint. Multiply by 100 and set the unit to '%' if desired.GPU Utilization Panel (if custom metric exists):
gpu_utilization_percent{job="diffusion-api"}
Here's an example visualization of P95 latency:
P95 request latency for the
/generate
endpoint over time, measured in seconds. Monitoring this helps identify performance degradation.
Monitoring is most effective when coupled with alerting. You can configure alerts directly in Grafana or use Prometheus's Alertmanager.
Example Prometheus alert rule (to be placed in a separate rules file):
# alert.rules.yml
groups:
- name: DiffusionAPIRules
rules:
- alert: HighAPIErrorRate
expr: sum(rate(http_requests_total{job="diffusion-api", handler="/generate", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="diffusion-api", handler="/generate"}[5m])) > 0.02
for: 5m # Alert fires if condition is true for 5 minutes
labels:
severity: warning
annotations:
summary: High error rate detected on diffusion API
description: 'Job {{ $labels.job }} handler {{ $labels.handler }} has an error rate above 2% for the last 5 minutes.'
- alert: HighP95Latency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="diffusion-api", handler="/generate"}[5m])) by (le)) > 15
for: 10m # Alert fires if P95 latency is above 15s for 10 minutes
labels:
severity: critical
annotations:
summary: High P95 latency detected on diffusion API
description: 'Job {{ $labels.job }} handler {{ $labels.handler }} has a P95 latency above 15s for the last 10 minutes (current value: {{ $value }}s).'
This practical provides a foundational monitoring setup. For a production system, consider expanding this by:
By implementing even this basic monitoring, you gain significant visibility into the operational health and performance of your deployed diffusion model, allowing you to diagnose issues, optimize performance, and ensure reliability at scale.
© 2025 ApX Machine Learning