Strategies for Using Spot and Preemptible Instances

The immense computational demand of training and large-scale inference makes cloud compute the largest line item in any AI budget. Spot and preemptible instances offer a direct solution, providing access to unused cloud capacity at discounts of up to 90% compared to on-demand pricing. This cost advantage, however, comes with a significant operational caveat: the cloud provider can reclaim these resources with little to no warning. A platform must not only tolerate these interruptions but be architected to handle them gracefully, turning a potential failure into a routine operational event.

The Preemption Challenge in ML Workflows

Preemption is the forceful termination of a spot instance when the cloud provider needs the capacity back. For AWS, this is a Spot Instance interruption; for GCP, it's a preemptible VM termination; and for Azure, it's an eviction. The instance receives a short notification, typically between 30 seconds and two minutes, before it is shut down. For a long-running training job, an unexpected termination can mean hours of lost work and wasted spend if the job's progress is not preserved.

The key to using spot instances effectively is to build systems that can withstand this unpredictability. The goal is to make job interruption and rescheduling a standard, automated procedure rather than a catastrophic failure.

The lifecycle of a pod on a spot instance, showing the graceful shutdown process triggered by a preemption notice.

Identifying Suitable Workloads

Not all workloads are created equal in their tolerance for interruption. The decision to use spot instances must be based on the application's fault-tolerance characteristics.

Ideal Candidates for Spot Instances:

Distributed Training with Checkpointing: As covered in Chapter 2, training jobs designed to save state frequently are perfect for spot instances. If a node is preempted, the job can be rescheduled and resume from the last checkpoint, minimizing lost work.
Hyperparameter Tuning: These jobs involve running many independent, short-to-medium duration training runs. The failure of one run has a minimal impact on the overall tuning process, as it can simply be rescheduled.
Stateless Batch Inference: Jobs that process data in independent batches are highly resilient. If a node fails, the unprocessed batches can easily be redistributed to other workers.
ML CI/CD Pipelines: Building containers, running integration tests, and other pipeline tasks are typically idempotent and stateless, making them excellent candidates for cost savings with spot instances.

Poor Candidates for Spot Instances:

Low-Latency Online Inference: Real-time serving endpoints have strict availability and latency requirements that cannot tolerate the unpredictability of preemption.
Stateful Services: Critical components of your platform, such as the KubeFlow control plane, a production feature store's online database, or artifact repositories, require the stability of on-demand instances.
Long-Running, Single-Process Jobs without Checkpointing: Any job that cannot save its progress is a poor fit, as preemption would mean starting over from scratch.

Architectural Patterns for Resilience

To reliably run workloads on spot instances, you must implement specific architectural patterns within Kubernetes.

1. Graceful Shutdown with Lifecycle Hooks

Kubernetes provides a preStop lifecycle hook that is executed just before a container is terminated. This is the primary mechanism for reacting to a preemption notice. When the cloud provider signals an impending shutdown, Kubernetes terminates the pods on that node, which in turn triggers the preStop hook in each container.

Your training application's container should use this hook to trigger a final, rapid checkpoint to persistent storage (like S3 or a PVC) and exit cleanly.

apiVersion: v1
kind: Pod
metadata:
  name: resilient-training-pod
spec:
  containers:
  - name: trainer
    image: my-pytorch-trainer:1.0
    command: ["python", "train.py", "--resume-from-checkpoint"]
    lifecycle:
      preStop:
        exec:
          # This script saves the model state to persistent storage
          # before the container is terminated.
          command: 
          - "/bin/sh"
          - "-c"
          - "python /app/save_checkpoint.py --path /mnt/checkpoints/final_checkpoint.pt"
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /mnt/checkpoints
  volumes:
  - name: checkpoint-storage
    persistentVolumeClaim:
      claimName: training-pvc

2. Hybrid Cluster with Taints and Tolerations

A common and effective strategy is to run a hybrid cluster composed of both on-demand and spot instance node groups.

On-Demand Node Group: A small, stable group of on-demand instances hosts critical cluster components. These nodes are marked with a Kubernetes taint (e.g., CriticalAddonsOnly=true:NoSchedule) to prevent regular workloads from being scheduled on them.
Spot Node Group: One or more large, autoscaling groups of spot instances are used for the ML workloads. These nodes are given a different taint (e.g., spot=true:NoExecute) to signal their transient nature.

Workload pods must then have the corresponding toleration to be scheduled on the spot nodes. This ensures that only fault-tolerant applications run on the volatile, low-cost compute.

A Kubernetes cluster architecture isolating critical services on stable on-demand nodes while running interruptible training jobs on a cost-effective spot node group using taints and tolerations.

To implement this, you combine taints, tolerations, and node affinity in your pod specifications.

# Part of a Pod or Deployment specification
spec:
  # This toleration allows the pod to be scheduled on a node with the spot taint.
  tolerations:
  - important: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  
  affinity:
    nodeAffinity:
      # This prefers scheduling on nodes labeled as spot instances.
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - lifecycle
            operator: In
            values:
            - spot
      # This prevents scheduling on nodes reserved for critical pods.
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - CriticalAddonsOnly
            operator: DoesNotExist

3. Instance Fleet Diversification

The probability of a spot instance being preempted varies by instance type, region, and availability zone. Relying on a single instance type makes your workloads vulnerable to price spikes or capacity shortages for that specific type.

Modern autoscalers like Karpenter or cloud-native configurations like AWS EC2 Fleet should be configured to select from a diverse list of instance types. For example, if your job needs a GPU with 16GB of memory, you can specify a range of suitable instance types (e.g., g4dn.xlarge, g5.xlarge, p3.2xlarge). The autoscaler will then provision whichever type is available at the lowest price, dramatically increasing the likelihood of acquiring capacity and improving resilience.

By combining graceful shutdown mechanisms, a hybrid cluster design, and intelligent instance selection, you can use the significant cost savings of spot instances without compromising the reliability of your MLOps platform. This operational discipline is a defining characteristic of a mature and cost-efficient AI infrastructure.

Was this section helpful?

References

Amazon EC2 Spot Instances User Guide, Amazon Web Services, 2024 (Amazon Web Services) - Provides official documentation and best practices for utilizing AWS Spot Instances, including interruption notices and cost optimization strategies.
Configure Pods to Use a Pod-Specific Lifecycle Hook, Kubernetes Authors, 2023 - Details the use of preStop hooks for graceful container termination in response to events like preemption notices.
Taints and Tolerations / Assigning Pods to Nodes, Kubernetes Authors, 2024 - Explains Kubernetes mechanisms for scheduling pods on specific nodes, enabling hybrid clusters with on-demand and spot instances.
Karpenter Documentation, AWS, Community, 2024 - Official documentation for Karpenter, an open-source node autoscaler designed for efficient and cost-effective Kubernetes cluster provisioning, including instance diversification.