The immense computational demand of training and large-scale inference makes cloud compute the largest line item in any AI budget. Spot and preemptible instances offer a direct solution, providing access to unused cloud capacity at discounts of up to 90% compared to on-demand pricing. This cost advantage, however, comes with a significant operational caveat: the cloud provider can reclaim these resources with little to no warning. A platform must not only tolerate these interruptions but be architected to handle them gracefully, turning a potential failure into a routine operational event.
Preemption is the forceful termination of a spot instance when the cloud provider needs the capacity back. For AWS, this is a Spot Instance interruption; for GCP, it's a preemptible VM termination; and for Azure, it's an eviction. The instance receives a short notification, typically between 30 seconds and two minutes, before it is shut down. For a long-running training job, an unexpected termination can mean hours of lost work and wasted spend if the job's progress is not preserved.
The key to using spot instances effectively is to build systems that can withstand this unpredictability. The goal is to make job interruption and rescheduling a standard, automated procedure rather than a catastrophic failure.
The lifecycle of a pod on a spot instance, showing the graceful shutdown process triggered by a preemption notice.
Not all workloads are created equal in their tolerance for interruption. The decision to use spot instances must be based on the application's fault-tolerance characteristics.
Ideal Candidates for Spot Instances:
Poor Candidates for Spot Instances:
To reliably run workloads on spot instances, you must implement specific architectural patterns within Kubernetes.
Kubernetes provides a preStop lifecycle hook that is executed just before a container is terminated. This is the primary mechanism for reacting to a preemption notice. When the cloud provider signals an impending shutdown, Kubernetes terminates the pods on that node, which in turn triggers the preStop hook in each container.
Your training application's container should use this hook to trigger a final, rapid checkpoint to persistent storage (like S3 or a PVC) and exit cleanly.
apiVersion: v1
kind: Pod
metadata:
name: resilient-training-pod
spec:
containers:
- name: trainer
image: my-pytorch-trainer:1.0
command: ["python", "train.py", "--resume-from-checkpoint"]
lifecycle:
preStop:
exec:
# This script saves the model state to persistent storage
# before the container is terminated.
command:
- "/bin/sh"
- "-c"
- "python /app/save_checkpoint.py --path /mnt/checkpoints/final_checkpoint.pt"
volumeMounts:
- name: checkpoint-storage
mountPath: /mnt/checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: training-pvc
A common and effective strategy is to run a hybrid cluster composed of both on-demand and spot instance node groups.
CriticalAddonsOnly=true:NoSchedule) to prevent regular workloads from being scheduled on them.spot=true:NoExecute) to signal their transient nature.Workload pods must then have the corresponding toleration to be scheduled on the spot nodes. This ensures that only fault-tolerant applications run on the volatile, low-cost compute.
A Kubernetes cluster architecture isolating critical services on stable on-demand nodes while running interruptible training jobs on a cost-effective spot node group using taints and tolerations.
To implement this, you combine taints, tolerations, and node affinity in your pod specifications.
# Part of a Pod or Deployment specification
spec:
# This toleration allows the pod to be scheduled on a node with the spot taint.
tolerations:
- important: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
# This prefers scheduling on nodes labeled as spot instances.
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- lifecycle
operator: In
values:
- spot
# This prevents scheduling on nodes reserved for critical pods.
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- CriticalAddonsOnly
operator: DoesNotExist
The probability of a spot instance being preempted varies by instance type, region, and availability zone. Relying on a single instance type makes your workloads vulnerable to price spikes or capacity shortages for that specific type.
Modern autoscalers like Karpenter or cloud-native configurations like AWS EC2 Fleet should be configured to select from a diverse list of instance types. For example, if your job needs a GPU with 16GB of memory, you can specify a range of suitable instance types (e.g., g4dn.xlarge, g5.xlarge, p3.2xlarge). The autoscaler will then provision whichever type is available at the lowest price, dramatically increasing the likelihood of acquiring capacity and improving resilience.
By combining graceful shutdown mechanisms, a hybrid cluster design, and intelligent instance selection, you can use the significant cost savings of spot instances without compromising the reliability of your MLOps platform. This operational discipline is a defining characteristic of a mature and cost-efficient AI infrastructure.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with