Now that we have explored the components of a scalable infrastructure, let's put theory into practice. This section guides you through deploying the containerized diffusion model application (which we assume you have built and pushed to a container registry) onto a Kubernetes cluster configured with GPU support. This exercise simulates a real-world deployment scenario, bringing together containerization, orchestration, and hardware acceleration.
Before proceeding, ensure you have the following ready:
kube-system
namespace.kubectl
: The Kubernetes command-line tool, configured to communicate with your cluster.your-registry/your-diffusion-app:v1.0
).A Kubernetes Deployment manages stateless applications, ensuring a specified number of replicas (Pods) are running. We need to define a Deployment that tells Kubernetes how to run our diffusion model container, specifically requesting GPU resources.
Create a file named diffusion-deployment.yaml
with the following content. Remember to replace your-registry/your-diffusion-app:v1.0
with the actual URL of your container image.
# diffusion-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: diffusion-deployment
labels:
app: diffusion-service
spec:
replicas: 1 # Start with one replica
selector:
matchLabels:
app: diffusion-service
template:
metadata:
labels:
app: diffusion-service
spec:
containers:
- name: diffusion-container
image: your-registry/your-diffusion-app:v1.0 # <-- Replace with your image URL
ports:
- containerPort: 8080 # Port your application listens on inside the container
resources:
limits:
nvidia.com/gpu: 1 # Request exactly 1 GPU
requests:
nvidia.com/gpu: 1 # Request exactly 1 GPU
# Optional: Add environment variables if your app needs them
# env:
# - name: MODEL_PATH
# value: "/models/stable-diffusion-v1-5"
# Optional: Add tolerations if your GPU nodes have specific taints
# tolerations:
# - key: "nvidia.com/gpu"
# operator: "Exists"
# effect: "NoSchedule"
Key points in this manifest:
kind: Deployment
: Specifies the object type.metadata.name
: Names the Deployment resource.spec.replicas
: Defines the desired number of running Pods. We start with 1. Autoscaling (discussed previously) can adjust this dynamically later.spec.selector.matchLabels
: Connects the Deployment to the Pods it manages using labels.spec.template.metadata.labels
: Assigns labels to the Pods created by this Deployment.spec.template.spec.containers
: Defines the container(s) to run within the Pod.
image
: The container image to pull. Ensure this is correct.ports.containerPort
: The port your application listens on inside the container. This doesn't expose it externally yet.resources.limits
and resources.requests
: This is where we request GPU resources. nvidia.com/gpu: 1
tells Kubernetes to schedule this Pod only on nodes with at least one available NVIDIA GPU and allocate one GPU to this container. The exact resource name (nvidia.com/gpu
) depends on the device plugin installation.tolerations
(Optional): If your GPU nodes are "tainted" to prevent non-GPU workloads from being scheduled on them, you might need to add tolerations to your Pod spec so they can be scheduled there.A Deployment runs your Pods, but we need a way to access them reliably, especially if Pods are recreated or scaled. A Kubernetes Service provides a stable network endpoint (IP address and DNS name) for accessing the Pods managed by a Deployment.
Create a file named diffusion-service.yaml
:
# diffusion-service.yaml
apiVersion: v1
kind: Service
metadata:
name: diffusion-loadbalancer
spec:
selector:
app: diffusion-service # Must match the Pod labels defined in the Deployment
ports:
- protocol: TCP
port: 80 # Port the Service will be accessible on externally (or internally)
targetPort: 8080 # Port the container listens on (from deployment.yaml)
type: LoadBalancer # Use LoadBalancer for cloud providers (or NodePort for local/manual setup)
Key points in this manifest:
kind: Service
: Specifies the object type.metadata.name
: Names the Service resource.spec.selector
: Selects the Pods this Service will route traffic to based on their labels. It must match the app: diffusion-service
label in our Deployment's Pod template.spec.ports
: Defines the port mapping. Traffic arriving at the Service on port: 80
will be forwarded to the container's targetPort: 8080
.spec.type: LoadBalancer
: This type is common for cloud providers (AWS, GCP, Azure). It automatically provisions a cloud load balancer that directs external traffic to the Service. If you're running locally (like Minikube) or need manual external configuration, you might use type: NodePort
instead, which exposes the service on a specific port on each node.Now, apply these configuration files to your cluster using kubectl
:
# Apply the Deployment configuration
kubectl apply -f diffusion-deployment.yaml
# Apply the Service configuration
kubectl apply -f diffusion-service.yaml
Kubernetes will start creating the resources defined in the YAML files.
Check the status of your Deployment and Pods:
# Check if the deployment is progressing
kubectl get deployments diffusion-deployment
# Expected output (might take a moment):
# NAME READY UP-TO-DATE AVAILABLE AGE
# diffusion-deployment 1/1 1 1 30s
# Check the Pod status (it needs to pull the image and start)
kubectl get pods -l app=diffusion-service
# Expected output (STATUS should become Running):
# NAME READY STATUS RESTARTS AGE
# diffusion-deployment-xxxxxxxxxx-yyyyy 1/1 Running 0 1m
# If the Pod is stuck in Pending or has errors, investigate:
# kubectl describe pod <pod-name-from-above>
# Look for Events related to scheduling (GPU availability) or image pulling.
Check the status of your Service and find its external IP (if using LoadBalancer
):
# Check the service status
kubectl get services diffusion-loadbalancer
# Expected output (EXTERNAL-IP might be <pending> initially):
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# diffusion-loadbalancer LoadBalancer 10.x.x.x <EXTERNAL_IP> 80:3xxxx/TCP 2m
It might take a few minutes for the cloud provider to provision the load balancer and assign an external IP address. Keep checking until the EXTERNAL-IP
is populated. If you used type: NodePort
, the PORT(S)
column will show the mapping like 80:<NodePort>/TCP
.
The following diagram illustrates the resources created:
This diagram shows the Kubernetes Deployment managing a Pod, which runs the containerized diffusion model application utilizing a GPU. A LoadBalancer Service exposes the application externally, routing user requests to the Pod.
Once the EXTERNAL-IP
is available for your diffusion-loadbalancer
Service (and the Pod is Running
), you can send an inference request. Assuming your application inside the container exposes a /generate
endpoint on port 8080 that accepts JSON prompts via POST requests, you can test it using curl
(replace <EXTERNAL_IP>
with the actual IP address from kubectl get services
):
# Replace <EXTERNAL_IP> with the actual External IP of your service
export SERVICE_IP=<EXTERNAL_IP>
# Send a sample request (adjust endpoint and JSON payload as needed)
curl -X POST http://${SERVICE_IP}:80/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "a serene landscape with a flowing river", "steps": 30}' \
--output generated_image.png # Save the output if it returns an image directly
# Or if it returns JSON with image URL/data:
# curl -X POST http://${SERVICE_IP}:80/generate -H "Content-Type: application/json" -d '{"prompt": "..."}'
Monitor the response. A successful request should return the generated image or relevant data, depending on your API design. If you encounter errors, check the Pod logs:
# Get the pod name first
kubectl get pods -l app=diffusion-service
# Tail the logs of the pod
kubectl logs -f <pod-name-from-above>
To confirm the application is using the GPU inside the Pod, you can execute nvidia-smi
within the container:
# Get the pod name
POD_NAME=$(kubectl get pods -l app=diffusion-service -o jsonpath='{.items[0].metadata.name}')
# Execute nvidia-smi inside the pod's container
kubectl exec $POD_NAME -- nvidia-smi
If the command runs successfully and shows GPU utilization details (especially when processing a request), your deployment is correctly configured to use hardware acceleration.
To avoid incurring costs, delete the Kubernetes resources when you are finished experimenting:
# Delete the Service (this will deprovision the cloud load balancer)
kubectl delete service diffusion-loadbalancer
# Delete the Deployment (this will terminate the Pods)
kubectl delete deployment diffusion-deployment
# Verify deletion
kubectl get deployments,services,pods -l app=diffusion-service
# Should return "No resources found"
This practical exercise demonstrated the core steps of deploying a GPU-accelerated diffusion model service on Kubernetes. You created Deployment and Service manifests, requested specific GPU resources, applied the configurations, verified the setup, and tested the endpoint. This forms the foundation for building more complex, scalable, and resilient deployment architectures, which we will explore further. Remember that production systems often require additional configurations for monitoring, logging, autoscaling, and robust error handling.
© 2025 ApX Machine Learning