All Courses

Hands-on Practical: Deploying RAG on Kubernetes with Monitoring

You've now explored the architectural considerations for deploying large-scale Retrieval-Augmented Generation (RAG) systems, including workflow orchestration, microservice design, and MLOps practices. This section provides a hands-on walkthrough of deploying a simplified RAG system on Kubernetes and configuring basic monitoring. This exercise will solidify your understanding of how these components come together in a practical, operational environment.

Prerequisites

Before you begin, ensure you have the following tools installed and configured:

kubectl command-line tool, configured to communicate with a Kubernetes cluster. You can use Minikube, Kind, k3s, or a managed Kubernetes service from a cloud provider (e.g., EKS, GKE, AKS).
Helm, the Kubernetes package manager.
Docker, for building and managing container images (though for this practical, we'll assume pre-built images or focus on the deployment manifests).
Basic familiarity with Kubernetes concepts (Pods, Deployments, Services, ConfigMaps, Namespaces).
Understanding of the RAG components discussed earlier (Retriever, Generator, Vector Store).

We will deploy a RAG system consisting of:

A Retriever API: A microservice that takes a query, converts it to an embedding, and searches a vector database.
A Generator API: A microservice that takes a query and retrieved contexts, and uses an LLM to generate an answer.
A Vector Database: We'll use Qdrant, deployed via its Helm chart, for simplicity in this exercise.
Monitoring Stack: Prometheus for metrics collection and Grafana for visualization.

Step 1: Setting up a Namespace

It's good practice to deploy your application components into a dedicated Kubernetes namespace.

kubectl create namespace rag-system

All subsequent kubectl commands in this practical should be run with the -n rag-system flag or you can set your context's default namespace: kubectl config set-context --current --namespace=rag-system.

Step 2: Deploying the Vector Database (Qdrant)

We'll use Helm to deploy Qdrant. This simplifies the setup significantly.

Add the Qdrant Helm repository:

helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo update

Install Qdrant:
```
helm install qdrant qdrant/qdrant -n rag-system \
  --set persistence.enabled=false \
  --set replicas=1 \
  --set service.http.servicePort=6333 \
  --set service.grpc.servicePort=6334
```
For this practical, we disable persistence (persistence.enabled=false) and run a single replica for simplicity. In a production environment, you would configure persistence and potentially more replicas. The service ports 6333 (HTTP) and 6334 (gRPC) are standard for Qdrant.

Verify Qdrant is running:
```
kubectl get pods -n rag-system -l app.kubernetes.io/name=qdrant
```
You should see a Qdrant pod in a Running state. The service qdrant will be available within the cluster at qdrant.rag-system.svc.cluster.local:6333.

Step 3: Containerizing RAG Components (Illustrative)

In a real project, you would have Dockerfiles for your Retriever and Generator APIs. For instance, a Python-based Retriever API using FastAPI might have a Dockerfile like this:

# Illustrative Dockerfile for a Retriever API
FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY ./retriever_app /app/retriever_app

# Assume QDRANT_HOST and QDRANT_PORT are configured via environment variables
# CMD ["uvicorn", "retriever_app.main:app", "--host", "0.0.0.0", "--port", "8000"]

And a Generator API, perhaps using Hugging Face's Text Generation Inference (TGI):

# Illustrative Dockerfile for a Generator API (if not using a pre-built TGI image)
# This would be more complex, involving model downloading and setup.
# For this practical, we might assume a pre-built TGI image or a simpler custom LLM service.

For this hands-on, we'll focus on the Kubernetes manifests and assume you have container images for your retriever and generator services available in a registry (e.g., Docker Hub, GCR, ECR). Let's assume your-repo/retriever-api:latest and your-repo/generator-api:latest. For the generator, you could also use a public TGI image like ghcr.io/huggingface/text-generation-inference:latest if you configure it appropriately.

Step 4: Deploying the Retriever API

Create a file named retriever-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: retriever-api
  namespace: rag-system
  labels:
    app: retriever-api
spec:
  replicas: 2 # Start with 2 replicas
  selector:
    matchLabels:
      app: retriever-api
  template:
    metadata:
      labels:
        app: retriever-api
      annotations:
        prometheus.io/scrape: "true" # Enable Prometheus scraping
        prometheus.io/port: "8000"   # Port your app exposes metrics on
        prometheus.io/path: "/metrics" # Path for metrics endpoint
    spec:
      containers:
      - name: retriever-api
        image: your-repo/retriever-api:latest # Replace with your actual image
        ports:
        - containerPort: 8000
        env:
        - name: QDRANT_HOST
          value: "qdrant.rag-system.svc.cluster.local"
        - name: QDRANT_PORT
          value: "6333"
        # Add readiness and liveness probes for deployment
        readinessProbe:
          httpGet:
            path: /health # Assuming a /health endpoint
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 20
        resources: # Define resource requests and limits
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: retriever-api-svc
  namespace: rag-system
  labels:
    app: retriever-api
spec:
  selector:
    app: retriever-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: ClusterIP # Internal service

This manifest defines:

A Deployment for the retriever API, specifying replicas, container image, environment variables for Qdrant's service address, and basic health probes.
Annotations for Prometheus to discover and scrape metrics from this service. Your application needs to expose metrics in Prometheus format on /metrics.
A Service of type ClusterIP to expose the retriever API internally within the Kubernetes cluster.

Apply it:

kubectl apply -f retriever-deployment.yaml -n rag-system

Step 5: Deploying the Generator API

Create a file named generator-deployment.yaml. This example assumes you are deploying a service like TGI, which typically requires specific arguments for model loading.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: generator-api
  namespace: rag-system
  labels:
    app: generator-api
spec:
  replicas: 1 # LLMs can be resource-intensive; adjust replicas based on your model and load
  selector:
    matchLabels:
      app: generator-api
  template:
    metadata:
      labels:
        app: generator-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "80" # TGI default metrics port is often 80, check your LLM server
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: generator-api
        # Example using a generic TGI image. Adjust for your specific LLM serving setup.
        image: ghcr.io/huggingface/text-generation-inference:latest # Replace or configure
        args:
          - "--model-id"
          - "mistralai/Mistral-7B-v0.1" # Example model, choose a small one for testing
          - "--port"
          - "8080" # Application port, TGI uses 80 by default for API if not specified via port
          # Add other necessary TGI arguments, e.g., sharding, quantization, if needed.
        ports:
        - name: http # API port
          containerPort: 8080 # Ensure this matches the port TGI listens on
        - name: metrics # Prometheus metrics port (TGI might use a different port or need specific config)
          containerPort: 80 # TGI often exposes metrics on port 80 by default. Verify.
        # Add readiness and liveness probes. For TGI, this could be the /health endpoint.
        readinessProbe:
          httpGet:
            path: /health
            port: 8080 # Port TGI uses for health checks
          initialDelaySeconds: 60 # Model loading can take time
          periodSeconds: 15
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 120
          periodSeconds: 30
        resources: # LLMs are resource-heavy. Adjust these significantly for production.
          requests:
            memory: "8Gi" # Example, highly model-dependent
            cpu: "2"      # Example
          limits:
            memory: "16Gi" # Example
            cpu: "4"       # Example
        # If your LLM requires GPUs, you'll need to configure node selectors and resource requests for GPUs.
        # Example:
        # resources:
        #   limits:
        #     nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: generator-api-svc
  namespace: rag-system
  labels:
    app: generator-api
spec:
  selector:
    app: generator-api
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 8080 # Port where TGI is listening for API requests
  type: ClusterIP

This manifest deploys the generator API. Main considerations:

Image and Model: Replace ghcr.io/huggingface/text-generation-inference:latest and mistralai/Mistral-7B-v0.1 with your chosen LLM serving solution and model. Ensure the args are correct for your setup.
Resource Allocation: LLMs are demanding. The resources section needs careful tuning. If using GPUs, ensure your Kubernetes nodes are GPU-enabled and you've specified GPU resources.
Prometheus Annotations: Update prometheus.io/port if your generator service exposes metrics on a different port. TGI typically exposes metrics on port 80.
Health Probes: The /health endpoint and port should match your LLM server's configuration. Model loading can take time, so initialDelaySeconds might need to be generous.

Apply it:

kubectl apply -f generator-deployment.yaml -n rag-system

A diagram representing the deployed RAG components within Kubernetes:

High-level architecture of the RAG system components deployed on Kubernetes, including interaction with the vector database and potential connections for user requests and monitoring.

Step 6: Deploying Prometheus and Grafana for Monitoring

We'll use the kube-prometheus-stack Helm chart, which conveniently bundles Prometheus, Grafana, Alertmanager, and various exporters.

Add the Prometheus Community Helm repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install kube-prometheus-stack:
```
helm install prometheus prometheus-community/kube-prometheus-stack -n rag-system \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
```
The serviceMonitorSelectorNilUsesHelmValues=false and podMonitorSelectorNilUsesHelmValues=false settings allow Prometheus to discover ServiceMonitor and PodMonitor resources across all namespaces if not explicitly restricted. For finer control in production, you might want to restrict this.

This installation can take a few minutes. Verify the pods:
```
kubectl get pods -n rag-system -l "release=prometheus"
```
You should see pods for Prometheus, Grafana, Alertmanager, and node-exporter running.

The annotations we added to our retriever-api and generator-api (prometheus.io/scrape: "true", etc.) are one way for Prometheus to discover scrape targets. Alternatively, and more robustly with kube-prometheus-stack, you would create ServiceMonitor resources.

Example ServiceMonitor for the retriever API (save as retriever-servicemonitor.yaml):
```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: retriever-api-sm
  namespace: rag-system
  labels:
    release: prometheus # Matches the Helm release name of kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: retriever-api # Selects the retriever-api-svc
  namespaceSelector:
    matchNames:
    - rag-system
  endpoints:
  - port: http # Name of the port in the Service definition (retriever-api-svc)
    path: /metrics # Path where metrics are exposed
    interval: 15s
```
Apply it: kubectl apply -f retriever-servicemonitor.yaml -n rag-system. You would create a similar ServiceMonitor for the generator-api-svc. This tells the Prometheus deployed by kube-prometheus-stack to scrape metrics from services matching these labels.

Step 7: Accessing Grafana and Building a Basic Dashboard

Access Grafana. The kube-prometheus-stack typically creates a Grafana service. Find its type and access method:
```
kubectl get svc -n rag-system prometheus-grafana
```
If it's ClusterIP, you can port-forward:
```
kubectl port-forward svc/prometheus-grafana 3000:80 -n rag-system
```
Then access Grafana at http://localhost:3000.
The default login for Grafana deployed by this chart is often admin / prom-operator. Check the chart's documentation for current defaults if these don't work.
Create a new dashboard:
- Click the + icon on the left sidebar, then Dashboard.
- Click Add new panel.
- In the Query tab, select Prometheus as the data source (it should be pre-configured).
- Enter a PromQL query. For example, to see the rate of HTTP requests to your retriever API (assuming your metrics are named appropriately, e.g., http_requests_total):
```
sum(rate(http_requests_total{job="rag-system/retriever-api-svc", handler!="/metrics"}[5m])) by (handler, method, status_code)
```
  Adjust job label based on how Prometheus discovers your service. If using ServiceMonitor, the job label might be retriever-api-sm or similar. Use the "Metrics browser" in Grafana to find available metrics.
- A simpler query for retriever pod CPU usage:
```
sum(rate(container_cpu_usage_seconds_total{namespace="rag-system", pod=~"retriever-api-.*", container="retriever-api"}[5m])) by (pod)
```
- Go to the Visualization settings on the right and choose a graph type (e.g., Time series).
- Give your panel a title (e.g., "Retriever API Request Rate" or "Retriever CPU Usage").
- Save the panel and the dashboard.

Below is an example of a Plotly JSON configuration that could represent a simple time series chart in a dashboard, showing API latency over time.

Illustrative data for a dashboard panel showing P95 latency for the Retriever and Generator APIs over a short time period.

Step 8: Testing the Deployed RAG System

To test the end-to-end system, you would typically have an entry point, like an API Gateway or a simple frontend application, that orchestrates calls to the retriever-api-svc and generator-api-svc. For this practical, we haven't deployed such a component to keep it focused.

However, you can test individual services using port-forwarding:

Port-forward the retriever-api-svc:

kubectl port-forward svc/retriever-api-svc 8081:80 -n rag-system

Now you can send requests to http://localhost:8081. For example, if your retriever API has a /search endpoint:

# Assuming the retriever expects a JSON payload with a "query" field
curl -X POST http://localhost:8081/search \
     -H "Content-Type: application/json" \
     -d '{"query": "What are the principles of distributed RAG?"}'

Similarly, port-forward and test the generator-api-svc:

kubectl port-forward svc/generator-api-svc 8082:80 -n rag-system

Then, send a request to the generator's endpoint (e.g., /generate or /v1/generate for TGI):

# Example for a TGI-like endpoint (adjust payload and endpoint as needed)
curl -X POST http://localhost:8082/generate \
     -H "Content-Type: application/json" \
     -d '{"inputs": "Query: What are distributed RAG principles?\nContext: Distributed RAG involves...", "parameters": {"max_new_tokens": 100}}'

After sending some test requests, go back to your Grafana dashboard. You should start seeing metrics populate the panels you created, reflecting the activity. For instance, request counts should increase, and latency graphs should show data points.

Further Considerations

This practical provides a foundational deployment. For a production-grade system, you would expand on this by:

Implementing an API Gateway: To provide a single entry point, authentication, rate limiting, etc.
Data Ingestion: The Qdrant instance here is empty. You'd need an ingestion pipeline (perhaps Kubernetes Jobs or CronJobs) to populate and update the vector database.
Advanced Monitoring and Logging: Integrate distributed tracing (e.g., Jaeger, OpenTelemetry) and centralized logging (e.g., ELK stack, Loki).
Autoscaling: Configure Horizontal Pod Autoscalers (HPAs) for your API deployments based on CPU, memory, or custom metrics.
CI/CD Pipelines: Automate the building, testing, and deployment of your RAG components.
Security: Implement network policies, manage secrets securely (e.g., HashiCorp Vault, Kubernetes Secrets with encryption), and harden container images.
Cost Optimization: Choose appropriate machine types, leverage spot instances where applicable, and monitor resource usage closely.

This hands-on exercise demonstrates the core mechanics of deploying and monitoring a RAG system on Kubernetes. By building upon these principles, you can operationalize complex, large-scale RAG solutions effectively.

Was this section helpful?