Spot Instance Strategies: Running Production Workloads at 70-90% Discount

The Spot Instance Opportunity

AWS Spot Instances are spare EC2 capacity sold at up to 90% below on-demand price. The catch: AWS can reclaim them with a 2-minute warning when capacity is needed. That constraint rules out many workloads — but a surprisingly large portion of a typical production system is spot-compatible with the right architecture.

A $50,000/month EC2 bill can realistically drop to $20,000-25,000 by shifting 50-60% of compute to spot. The arithmetic is compelling; the challenge is designing systems that handle interruptions gracefully.

Spot-Safe vs Spot-Unsafe Workloads

SPOT-SAFE (ideal for spot):
  ✓ Stateless web/API servers (redirect traffic on interruption)
  ✓ Batch processing (checkpoint state, restart from checkpoint)
  ✓ CI/CD build agents (rebuild on interruption, no data loss)
  ✓ ML training (checkpoint model weights every N steps)
  ✓ Data processing jobs (idempotent with resumable stages)
  ✓ Development environments (acceptable to restart)
  ✓ Log aggregation and stream processing (with proper buffering)
  ✓ Rendering and transcoding (chunk-based work)

SPOT-UNSAFE (keep on on-demand):
  ✗ Databases (Postgres, MySQL, MongoDB, Redis)
  ✗ Message broker primaries (Kafka, RabbitMQ)
  ✗ Stateful services with session state in memory
  ✗ Single-instance critical services with no HA
  ✗ Long-running transactions that cannot be interrupted
  ✗ Services with strict latency SLAs (spot startup latency varies)

Instance Diversification: The Core Strategy

# The biggest spot mistake: using a single instance type
# If that type runs out of spot capacity, your entire fleet goes down

# Instead: specify 10-15 instance types of similar size
# When one pool gets interrupted, others absorb the load

# Equivalent instance types for a "4 vCPU, 16GB RAM" workload:
# m7g.xlarge, m7i.xlarge, m6g.xlarge, m6i.xlarge,
# m5.xlarge, m5a.xlarge, m4.xlarge,
# c7g.2xlarge, c7i.2xlarge (c-type has 2x CPU ratio — adjust sizing)

# AWS recommends: diversify across families AND availability zones
# The more diversity, the more likely you find available capacity

# EKS Managed Node Group with aggressive spot diversification
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1

managedNodeGroups:
  - name: spot-workers
    spot: true
    instanceTypes:
      - m7g.xlarge    # Graviton 4vCPU 16GB
      - m7g.2xlarge   # Graviton 8vCPU 32GB
      - m7i.xlarge    # Intel  4vCPU 16GB
      - m6g.xlarge    # Graviton previous gen
      - m6i.xlarge    # Intel previous gen
      - m5.xlarge     # Intel 2-gen old
      - c7g.2xlarge   # Compute Graviton (similar resources)
      - c6g.2xlarge   # Compute Graviton prev gen
      - r7g.large     # Memory Graviton (less CPU)
    
    minSize: 2
    maxSize: 100
    desiredCapacity: 10
    
    availabilityZones:
      - us-east-1a
      - us-east-1b
      - us-east-1c
    
    labels:
      lifecycle: spot
    
    taints:
      - key: lifecycle
        value: spot
        effect: PreferNoSchedule  # Prefer spot but fallback if needed

Handling Spot Interruptions

# AWS sends an interruption notice 2 minutes before reclaiming a Spot instance
# The notice appears at: http://169.254.169.254/latest/meta-data/spot/interruption-action
# and as an EC2 EventBridge event

# Python health check that monitors spot interruption notice
import requests
import signal
import sys
import logging

logger = logging.getLogger(__name__)

def check_spot_interruption():
    """Poll instance metadata for spot interruption notice."""
    try:
        # IMDSv2 token
        token_response = requests.put(
            'http://169.254.169.254/latest/api/token',
            headers={'X-aws-ec2-metadata-token-ttl-seconds': '21600'},
            timeout=1
        )
        token = token_response.text
        
        response = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/interruption-action',
            headers={'X-aws-ec2-metadata-token': token},
            timeout=1
        )
        if response.status_code == 200:
            logger.warning("SPOT INTERRUPTION NOTICE RECEIVED! Action: %s", response.text)
            return True
    except requests.exceptions.RequestException:
        pass
    return False

# Kubernetes node termination handler (use AWS Node Termination Handler instead)
# helm install aws-node-termination-handler #   eks/aws-node-termination-handler #   --namespace kube-system #   --set enableSpotInterruptionDraining=true #   --set enableRebalanceMonitoring=true #   --set enableScheduledEventDraining=true
# 
# This automatically:
# 1. Detects 2-min interruption notice
# 2. Cordons the node (no new pods scheduled)
# 3. Drains pods gracefully (respects PodDisruptionBudgets)
# 4. Pods reschedule to healthy nodes before termination

# PodDisruptionBudget — ensures minimum availability during drains
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2  # Always keep at least 2 replicas running
  # Or use: maxUnavailable: 1
  selector:
    matchLabels:
      app: api

---
# Deployment with proper termination handling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5
  template:
    spec:
      # Give pods 30s to finish in-flight requests
      terminationGracePeriodSeconds: 30
      
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                # Sleep briefly to allow load balancer to drain
                command: ["/bin/sh", "-c", "sleep 5"]
          
          # Readiness probe — pod removed from LB when not ready
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 2

Spot Fleet for Batch Workloads

{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
    "TargetCapacity": 100,
    "TargetCapacityUnitType": "vcpu",
    "AllocationStrategy": "priceCapacityOptimized",
    "InstanceInterruptionBehavior": "terminate",
    "LaunchTemplateConfigs": [
      {
        "LaunchTemplateSpecification": {
          "LaunchTemplateId": "lt-0123456789abcdef0",
          "Version": "$Latest"
        },
        "Overrides": [
          {"InstanceType": "c7g.2xlarge", "WeightedCapacity": 8},
          {"InstanceType": "c6g.2xlarge", "WeightedCapacity": 8},
          {"InstanceType": "c5.2xlarge",  "WeightedCapacity": 8},
          {"InstanceType": "m7g.xlarge",  "WeightedCapacity": 4},
          {"InstanceType": "m6g.xlarge",  "WeightedCapacity": 4},
          {"InstanceType": "m5.xlarge",   "WeightedCapacity": 4}
        ]
      }
    ]
  }
}

Auto Scaling Group Mixed Instances Policy

resource "aws_autoscaling_group" "workers" {
  name = "spot-workers"

  min_size         = 2
  max_size         = 50
  desired_capacity = 10

  vpc_zone_identifier = module.vpc.private_subnets

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2     # Always keep 2 on-demand
      on_demand_percentage_above_base_capacity = 0     # 0% on-demand above base = all spot
      spot_allocation_strategy                 = "price-capacity-optimized"
      # price-capacity-optimized: best balance of lowest price + highest availability
      # capacity-optimized: prioritize availability over price
      # lowest-price: cheapest but highest interruption risk
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.worker.id
        version            = "$Latest"
      }

      override {
        instance_type     = "m7g.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m6g.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m7i.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m6i.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "m5.xlarge"
        weighted_capacity = "1"
      }
      override {
        instance_type     = "c7g.2xlarge"
        weighted_capacity = "2"  # 2x CPU weight
      }
    }
  }

  tag {
    key                 = "k8s.io/cluster-autoscaler/enabled"
    value               = "true"
    propagate_at_launch = true
  }
}

ML Training with Spot Checkpointing

# PyTorch training with spot-aware checkpointing
import torch
import os
import signal

class SpotAwareTrainer:
    def __init__(self, model, checkpoint_dir, checkpoint_interval=100):
        self.model = model
        self.checkpoint_dir = checkpoint_dir
        self.checkpoint_interval = checkpoint_interval
        self.interrupted = False
        
        # Register SIGTERM handler (sent before spot interruption)
        signal.signal(signal.SIGTERM, self._handle_sigterm)
    
    def _handle_sigterm(self, signum, frame):
        """Save checkpoint immediately on spot interruption signal."""
        print("SIGTERM received — saving emergency checkpoint")
        self.save_checkpoint("emergency")
        self.interrupted = True
    
    def save_checkpoint(self, tag=""):
        path = os.path.join(self.checkpoint_dir, f"checkpoint_{tag}.pt")
        torch.save({
            'epoch': self.current_epoch,
            'step': self.current_step,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'loss': self.current_loss,
        }, path)
        # Also upload to S3 immediately
        self._upload_to_s3(path)
    
    def load_checkpoint(self):
        """Resume from latest checkpoint on restart."""
        checkpoints = sorted([
            f for f in os.listdir(self.checkpoint_dir) 
            if f.startswith('checkpoint_')
        ])
        if checkpoints:
            path = os.path.join(self.checkpoint_dir, checkpoints[-1])
            checkpoint = torch.load(path)
            self.model.load_state_dict(checkpoint['model_state_dict'])
            return checkpoint['epoch'], checkpoint['step']
        return 0, 0
    
    def train(self, dataloader, epochs):
        start_epoch, start_step = self.load_checkpoint()
        
        for epoch in range(start_epoch, epochs):
            self.current_epoch = epoch
            for step, batch in enumerate(dataloader):
                if epoch == start_epoch and step < start_step:
                    continue  # Skip already-processed steps
                    
                self.current_step = step
                loss = self.training_step(batch)
                self.current_loss = loss
                
                # Regular checkpoint
                if step % self.checkpoint_interval == 0:
                    self.save_checkpoint(f"{epoch}_{step}")
                
                if self.interrupted:
                    print("Training interrupted — resumable from checkpoint")
                    return

Real-World Savings Reference

Current spot prices (us-east-1, March 2026 approximate):

Instance Type  On-Demand  Spot      Savings
─────────────────────────────────────────────
m7g.xlarge     $0.163/hr  $0.049/hr  70%
c7g.2xlarge    $0.290/hr  $0.073/hr  75%
r7g.xlarge     $0.254/hr  $0.063/hr  75%
g5.xlarge      $1.006/hr  $0.302/hr  70%
p4d.24xlarge   $32.77/hr  $9.83/hr   70%

Spot prices fluctuate. Use EC2 Spot Price History to pick pools with
consistent low pricing and rare interruptions.

For 50 spot workers (c7g.2xlarge) running 24/7:
  On-demand: $0.290 × 50 × 720hr = $10,440/month
  Spot:      $0.073 × 50 × 720hr =  $2,628/month
  Saving: $7,812/month

Even accounting for 5% interruption overhead (duplicate work, restarts):
  Net saving: $7,400/month

Conclusion

Spot instances are not a niche cost-cutting trick — they are a fundamental part of cost-efficient cloud architecture. The key is accepting that interruptions will happen, designing stateless or checkpointing systems, and diversifying across many instance pools to minimize simultaneous interruptions. Done right, you achieve 70-90% compute cost reduction on the portions of your infrastructure that matter most volumetrically.

Spot Instance Strategies: Running Production Workloads at 70-90% Discount

The Spot Instance Opportunity

Spot-Safe vs Spot-Unsafe Workloads

Instance Diversification: The Core Strategy

Handling Spot Interruptions

Spot Fleet for Batch Workloads

Auto Scaling Group Mixed Instances Policy

ML Training with Spot Checkpointing

Real-World Savings Reference

Conclusion

Tags

Related Articles

AWS Budgets and Cost Anomaly Detection: Automated FinOps Guardrails

Cloud Tagging Strategy at Scale: Enforcing Cost Allocation Across 100+ AWS Accounts

Multi-Cloud Networking Costs: Transit Gateway, VPC Peering, and Cross-Cloud Egress

Ready to Transform Your Infrastructure?