Kubernetes is a system for managing containerized applications. To define and run machine learning workloads, understanding its fundamental objects is important. Although Kubernetes includes many components, interaction will primarily involve three objects: Pods, Services, and Deployments. These three objects provide the foundation for building scalable and resilient machine learning systems on the platform.
In the Kubernetes environment, the smallest and most basic deployable object is not a container, but a Pod. A Pod represents a single instance of a running process in your cluster and encapsulates one or more containers. It also includes shared storage and network resources for those containers.
While you can run multiple containers inside a single Pod, the common practice is to have one main application container per Pod. This "one-process-per-container" approach keeps your architecture clean. So, when would you use multiple containers? A typical use case in ML is the "sidecar" pattern. Imagine you have a container serving your model. You can add a second, "sidecar" container to the same Pod that handles a secondary task, like fetching model updates from cloud storage or shipping logs to a central collector. Because containers in a Pod share the same network namespace and can share storage volumes, they can communicate with each other efficiently.
A Pod containing a primary application container and a sidecar for a supporting task.
An important characteristic of Pods is that they are considered ephemeral. They can be terminated due to a node failure, resource constraints, or during an application update. When a Pod is destroyed, it is gone for good, along with its unique IP address. This transient nature means you should never create and manage individual Pods directly for any application that needs to be reliable. Instead, you use a higher-level controller, like a Deployment, to manage them for you.
A Deployment is a controller that provides declarative updates for Pods. You describe the desired state in a Deployment object, and the Deployment controller works to change the actual state to the desired state at a controlled rate.
Its primary functions are:
Under the hood, a Deployment manages a ReplicaSet, which is the object responsible for maintaining the specified number of replicas. You almost never interact with ReplicaSets directly, as Deployments orchestrate them to handle versioning and updates.
Here is a simplified example of a Deployment manifest in YAML. This file declaratively defines a desired state: three replicas of a model-serving application.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-api-deployment
spec:
replicas: 3 # 1. We want 3 identical Pods running
selector:
matchLabels:
app: inference-api # 2. The Deployment finds Pods with this label
template: # 3. This is the blueprint for the Pods it creates
metadata:
labels:
app: inference-api # 4. The Pods get this label
spec:
containers:
- name: model-server
image: your-repo/your-model-server:v1.2 # 5. The container image to run
ports:
- containerPort: 8000
replicas: Tells the Deployment to maintain three running instances of our Pod.selector: Defines how the Deployment finds which Pods to manage. It looks for Pods with a label that matches app: inference-api.template: A blueprint for the Pods that the Deployment will create. It contains its own metadata and spec.template.metadata.labels: The labels applied to each Pod created by this Deployment. The selector must match these labels.template.spec.containers: The list of containers to run inside the Pod. Here, we define a single container named model-server using a specific image version.We've established that Deployments manage a set of identical, yet ephemeral, Pods. Each Pod has its own internal IP address that can change whenever it is recreated. This presents a problem: how can other applications, or external users, reliably connect to your model-serving Pods if their addresses are constantly changing?
This is where the Service object comes in. A Service provides a stable, abstract endpoint for a set of Pods. It gets a persistent IP address and a DNS name within the cluster. When traffic is sent to the Service, it acts as an internal load balancer, automatically routing the request to one of the healthy Pods that it targets. A Service finds its target Pods using the same label and selector mechanism as a Deployment.
Kubernetes offers several types of Services, but for ML applications, you will most often use these two:
ClusterIP Service.LoadBalancer Service on a cloud platform like AWS, GCP, or Azure, Kubernetes will automatically provision an external load balancer and configure it to route external internet traffic to your Service's Pods. This is the standard way to expose a production inference API to the public.The diagram below illustrates how these components work together. A Deployment ensures three Pods are running. A LoadBalancer Service provides a single, stable entry point for external traffic and distributes requests among the Pods.
An external request is sent to a stable
LoadBalancerIP, which the Service routes to one of the available Pods matching its label selector. The Deployment ensures the desired number of Pods are always running.
Together, Deployments and Services form the backbone of a scalable application on Kubernetes. The Deployment handles the lifecycle and availability of your model-serving Pods, while the Service provides a stable network endpoint to make them accessible.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with