You've now explored the architectural considerations for deploying large-scale Retrieval-Augmented Generation (RAG) systems, including workflow orchestration, microservice design, and MLOps practices. This section provides a hands-on walkthrough of deploying a simplified RAG system on Kubernetes and configuring basic monitoring. This exercise will solidify your understanding of how these components come together in a practical, operational environment.
Before you begin, ensure you have the following tools installed and configured:
kubectl
command-line tool, configured to communicate with a Kubernetes cluster. You can use Minikube, Kind, k3s, or a managed Kubernetes service from a cloud provider (e.g., EKS, GKE, AKS).We will deploy a RAG system consisting of:
It's good practice to deploy your application components into a dedicated Kubernetes namespace.
kubectl create namespace rag-system
All subsequent kubectl
commands in this practical should be run with the -n rag-system
flag or you can set your context's default namespace: kubectl config set-context --current --namespace=rag-system
.
We'll use Helm to deploy Qdrant. This simplifies the setup significantly.
Add the Qdrant Helm repository:
helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo update
Install Qdrant:
helm install qdrant qdrant/qdrant -n rag-system \
--set persistence.enabled=false \
--set replicas=1 \
--set service.http.servicePort=6333 \
--set service.grpc.servicePort=6334
For this practical, we disable persistence (persistence.enabled=false
) and run a single replica for simplicity. In a production environment, you would configure persistence and potentially more replicas. The service ports 6333 (HTTP) and 6334 (gRPC) are standard for Qdrant.
Verify Qdrant is running:
kubectl get pods -n rag-system -l app.kubernetes.io/name=qdrant
You should see a Qdrant pod in a Running
state. The service qdrant
will be available within the cluster at qdrant.rag-system.svc.cluster.local:6333
.
In a real project, you would have Dockerfiles for your Retriever and Generator APIs. For instance, a Python-based Retriever API using FastAPI might have a Dockerfile like this:
# Illustrative Dockerfile for a Retriever API
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ./retriever_app /app/retriever_app
# Assume QDRANT_HOST and QDRANT_PORT are configured via environment variables
# CMD ["uvicorn", "retriever_app.main:app", "--host", "0.0.0.0", "--port", "8000"]
And a Generator API, perhaps using Hugging Face's Text Generation Inference (TGI):
# Illustrative Dockerfile for a Generator API (if not using a pre-built TGI image)
# This would be more complex, involving model downloading and setup.
# For this practical, we might assume a pre-built TGI image or a simpler custom LLM service.
For this hands-on, we'll focus on the Kubernetes manifests and assume you have container images for your retriever and generator services available in a registry (e.g., Docker Hub, GCR, ECR). Let's assume your-repo/retriever-api:latest
and your-repo/generator-api:latest
. For the generator, you could also use a public TGI image like ghcr.io/huggingface/text-generation-inference:latest
if you configure it appropriately.
Create a file named retriever-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: retriever-api
namespace: rag-system
labels:
app: retriever-api
spec:
replicas: 2 # Start with 2 replicas
selector:
matchLabels:
app: retriever-api
template:
metadata:
labels:
app: retriever-api
annotations:
prometheus.io/scrape: "true" # Enable Prometheus scraping
prometheus.io/port: "8000" # Port your app exposes metrics on
prometheus.io/path: "/metrics" # Path for metrics endpoint
spec:
containers:
- name: retriever-api
image: your-repo/retriever-api:latest # Replace with your actual image
ports:
- containerPort: 8000
env:
- name: QDRANT_HOST
value: "qdrant.rag-system.svc.cluster.local"
- name: QDRANT_PORT
value: "6333"
# Add readiness and liveness probes for deployment
readinessProbe:
httpGet:
path: /health # Assuming a /health endpoint
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 20
resources: # Define resource requests and limits
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: retriever-api-svc
namespace: rag-system
labels:
app: retriever-api
spec:
selector:
app: retriever-api
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: ClusterIP # Internal service
This manifest defines:
Deployment
for the retriever API, specifying replicas, container image, environment variables for Qdrant's service address, and basic health probes./metrics
.Service
of type ClusterIP
to expose the retriever API internally within the Kubernetes cluster.Apply it:
kubectl apply -f retriever-deployment.yaml -n rag-system
Create a file named generator-deployment.yaml
. This example assumes you are deploying a service like TGI, which typically requires specific arguments for model loading.
apiVersion: apps/v1
kind: Deployment
metadata:
name: generator-api
namespace: rag-system
labels:
app: generator-api
spec:
replicas: 1 # LLMs can be resource-intensive; adjust replicas based on your model and load
selector:
matchLabels:
app: generator-api
template:
metadata:
labels:
app: generator-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "80" # TGI default metrics port is often 80, check your LLM server
prometheus.io/path: "/metrics"
spec:
containers:
- name: generator-api
# Example using a generic TGI image. Adjust for your specific LLM serving setup.
image: ghcr.io/huggingface/text-generation-inference:latest # Replace or configure
args:
- "--model-id"
- "mistralai/Mistral-7B-v0.1" # Example model, choose a small one for testing
- "--port"
- "8080" # Application port, TGI uses 80 by default for API if not specified via port
# Add other necessary TGI arguments, e.g., sharding, quantization, if needed.
ports:
- name: http # API port
containerPort: 8080 # Ensure this matches the port TGI listens on
- name: metrics # Prometheus metrics port (TGI might use a different port or need specific config)
containerPort: 80 # TGI often exposes metrics on port 80 by default. Verify.
# Add readiness and liveness probes. For TGI, this could be the /health endpoint.
readinessProbe:
httpGet:
path: /health
port: 8080 # Port TGI uses for health checks
initialDelaySeconds: 60 # Model loading can take time
periodSeconds: 15
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
resources: # LLMs are resource-heavy. Adjust these significantly for production.
requests:
memory: "8Gi" # Example, highly model-dependent
cpu: "2" # Example
limits:
memory: "16Gi" # Example
cpu: "4" # Example
# If your LLM requires GPUs, you'll need to configure node selectors and resource requests for GPUs.
# Example:
# resources:
# limits:
# nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: generator-api-svc
namespace: rag-system
labels:
app: generator-api
spec:
selector:
app: generator-api
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080 # Port where TGI is listening for API requests
type: ClusterIP
This manifest deploys the generator API. Main considerations:
ghcr.io/huggingface/text-generation-inference:latest
and mistralai/Mistral-7B-v0.1
with your chosen LLM serving solution and model. Ensure the args
are correct for your setup.resources
section needs careful tuning. If using GPUs, ensure your Kubernetes nodes are GPU-enabled and you've specified GPU resources.prometheus.io/port
if your generator service exposes metrics on a different port. TGI typically exposes metrics on port 80./health
endpoint and port should match your LLM server's configuration. Model loading can take time, so initialDelaySeconds
might need to be generous.Apply it:
kubectl apply -f generator-deployment.yaml -n rag-system
A diagram representing the deployed RAG components within Kubernetes:
High-level architecture of the RAG system components deployed on Kubernetes, including interaction with the vector database and potential connections for user requests and monitoring.
We'll use the kube-prometheus-stack
Helm chart, which conveniently bundles Prometheus, Grafana, Alertmanager, and various exporters.
Add the Prometheus Community Helm repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Install kube-prometheus-stack
:
helm install prometheus prometheus-community/kube-prometheus-stack -n rag-system \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
The serviceMonitorSelectorNilUsesHelmValues=false
and podMonitorSelectorNilUsesHelmValues=false
settings allow Prometheus to discover ServiceMonitor
and PodMonitor
resources across all namespaces if not explicitly restricted. For finer control in production, you might want to restrict this.
This installation can take a few minutes. Verify the pods:
kubectl get pods -n rag-system -l "release=prometheus"
You should see pods for Prometheus, Grafana, Alertmanager, and node-exporter running.
The annotations we added to our retriever-api
and generator-api
(prometheus.io/scrape: "true"
, etc.) are one way for Prometheus to discover scrape targets. Alternatively, and more robustly with kube-prometheus-stack
, you would create ServiceMonitor
resources.
Example ServiceMonitor
for the retriever API (save as retriever-servicemonitor.yaml
):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: retriever-api-sm
namespace: rag-system
labels:
release: prometheus # Matches the Helm release name of kube-prometheus-stack
spec:
selector:
matchLabels:
app: retriever-api # Selects the retriever-api-svc
namespaceSelector:
matchNames:
- rag-system
endpoints:
- port: http # Name of the port in the Service definition (retriever-api-svc)
path: /metrics # Path where metrics are exposed
interval: 15s
Apply it: kubectl apply -f retriever-servicemonitor.yaml -n rag-system
. You would create a similar ServiceMonitor
for the generator-api-svc
. This tells the Prometheus deployed by kube-prometheus-stack
to scrape metrics from services matching these labels.
Access Grafana. The kube-prometheus-stack
typically creates a Grafana service. Find its type and access method:
kubectl get svc -n rag-system prometheus-grafana
If it's ClusterIP
, you can port-forward:
kubectl port-forward svc/prometheus-grafana 3000:80 -n rag-system
Then access Grafana at http://localhost:3000
.
The default login for Grafana deployed by this chart is often admin
/ prom-operator
. Check the chart's documentation for current defaults if these don't work.
Create a new dashboard:
Click the +
icon on the left sidebar, then Dashboard
.
Click Add new panel
.
In the Query
tab, select Prometheus
as the data source (it should be pre-configured).
Enter a PromQL query. For example, to see the rate of HTTP requests to your retriever API (assuming your metrics are named appropriately, e.g., http_requests_total
):
sum(rate(http_requests_total{job="rag-system/retriever-api-svc", handler!="/metrics"}[5m])) by (handler, method, status_code)
Adjust job
label based on how Prometheus discovers your service. If using ServiceMonitor
, the job label might be retriever-api-sm
or similar. Use the "Metrics browser" in Grafana to find available metrics.
A simpler query for retriever pod CPU usage:
sum(rate(container_cpu_usage_seconds_total{namespace="rag-system", pod=~"retriever-api-.*", container="retriever-api"}[5m])) by (pod)
Go to the Visualization
settings on the right and choose a graph type (e.g., Time series).
Give your panel a title (e.g., "Retriever API Request Rate" or "Retriever CPU Usage").
Save the panel and the dashboard.
Below is an example of a Plotly JSON configuration that could represent a simple time series chart in a dashboard, showing API latency over time.
Illustrative data for a dashboard panel showing P95 latency for the Retriever and Generator APIs over a short time period.
To test the end-to-end system, you would typically have an entry point, like an API Gateway or a simple frontend application, that orchestrates calls to the retriever-api-svc
and generator-api-svc
. For this practical, we haven't deployed such a component to keep it focused.
However, you can test individual services using port-forwarding:
Port-forward the retriever-api-svc
:
kubectl port-forward svc/retriever-api-svc 8081:80 -n rag-system
Now you can send requests to http://localhost:8081
. For example, if your retriever API has a /search
endpoint:
# Assuming the retriever expects a JSON payload with a "query" field
curl -X POST http://localhost:8081/search \
-H "Content-Type: application/json" \
-d '{"query": "What are the principles of distributed RAG?"}'
Similarly, port-forward and test the generator-api-svc
:
kubectl port-forward svc/generator-api-svc 8082:80 -n rag-system
Then, send a request to the generator's endpoint (e.g., /generate
or /v1/generate
for TGI):
# Example for a TGI-like endpoint (adjust payload and endpoint as needed)
curl -X POST http://localhost:8082/generate \
-H "Content-Type: application/json" \
-d '{"inputs": "Query: What are distributed RAG principles?\nContext: Distributed RAG involves...", "parameters": {"max_new_tokens": 100}}'
After sending some test requests, go back to your Grafana dashboard. You should start seeing metrics populate the panels you created, reflecting the activity. For instance, request counts should increase, and latency graphs should show data points.
This practical provides a foundational deployment. For a production-grade system, you would expand on this by:
This hands-on exercise demonstrates the core mechanics of deploying and monitoring a RAG system on Kubernetes. By building upon these principles, you can operationalize complex, large-scale RAG solutions effectively.
Was this section helpful?
© 2025 ApX Machine Learning