The Spot Instance Opportunity
AWS Spot Instances are spare EC2 capacity sold at up to 90% below on-demand price. The catch: AWS can reclaim them with a 2-minute warning when capacity is needed. That constraint rules out many workloads β but a surprisingly large portion of a typical production system is spot-compatible with the right architecture.
A $50,000/month EC2 bill can realistically drop to $20,000-25,000 by shifting 50-60% of compute to spot. The arithmetic is compelling; the challenge is designing systems that handle interruptions gracefully.
Spot-Safe vs Spot-Unsafe Workloads
SPOT-SAFE (ideal for spot):
β Stateless web/API servers (redirect traffic on interruption)
β Batch processing (checkpoint state, restart from checkpoint)
β CI/CD build agents (rebuild on interruption, no data loss)
β ML training (checkpoint model weights every N steps)
β Data processing jobs (idempotent with resumable stages)
β Development environments (acceptable to restart)
β Log aggregation and stream processing (with proper buffering)
β Rendering and transcoding (chunk-based work)
SPOT-UNSAFE (keep on on-demand):
β Databases (Postgres, MySQL, MongoDB, Redis)
β Message broker primaries (Kafka, RabbitMQ)
β Stateful services with session state in memory
β Single-instance critical services with no HA
β Long-running transactions that cannot be interrupted
β Services with strict latency SLAs (spot startup latency varies)
Instance Diversification: The Core Strategy
# The biggest spot mistake: using a single instance type
# If that type runs out of spot capacity, your entire fleet goes down
# Instead: specify 10-15 instance types of similar size
# When one pool gets interrupted, others absorb the load
# Equivalent instance types for a "4 vCPU, 16GB RAM" workload:
# m7g.xlarge, m7i.xlarge, m6g.xlarge, m6i.xlarge,
# m5.xlarge, m5a.xlarge, m4.xlarge,
# c7g.2xlarge, c7i.2xlarge (c-type has 2x CPU ratio β adjust sizing)
# AWS recommends: diversify across families AND availability zones
# The more diversity, the more likely you find available capacity
# EKS Managed Node Group with aggressive spot diversification
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
managedNodeGroups:
- name: spot-workers
spot: true
instanceTypes:
- m7g.xlarge # Graviton 4vCPU 16GB
- m7g.2xlarge # Graviton 8vCPU 32GB
- m7i.xlarge # Intel 4vCPU 16GB
- m6g.xlarge # Graviton previous gen
- m6i.xlarge # Intel previous gen
- m5.xlarge # Intel 2-gen old
- c7g.2xlarge # Compute Graviton (similar resources)
- c6g.2xlarge # Compute Graviton prev gen
- r7g.large # Memory Graviton (less CPU)
minSize: 2
maxSize: 100
desiredCapacity: 10
availabilityZones:
- us-east-1a
- us-east-1b
- us-east-1c
labels:
lifecycle: spot
taints:
- key: lifecycle
value: spot
effect: PreferNoSchedule # Prefer spot but fallback if needed
Handling Spot Interruptions
# AWS sends an interruption notice 2 minutes before reclaiming a Spot instance
# The notice appears at: http://169.254.169.254/latest/meta-data/spot/interruption-action
# and as an EC2 EventBridge event
# Python health check that monitors spot interruption notice
import requests
import signal
import sys
import logging
logger = logging.getLogger(__name__)
def check_spot_interruption():
"""Poll instance metadata for spot interruption notice."""
try:
# IMDSv2 token
token_response = requests.put(
'http://169.254.169.254/latest/api/token',
headers={'X-aws-ec2-metadata-token-ttl-seconds': '21600'},
timeout=1
)
token = token_response.text
response = requests.get(
'http://169.254.169.254/latest/meta-data/spot/interruption-action',
headers={'X-aws-ec2-metadata-token': token},
timeout=1
)
if response.status_code == 200:
logger.warning("SPOT INTERRUPTION NOTICE RECEIVED! Action: %s", response.text)
return True
except requests.exceptions.RequestException:
pass
return False
# Kubernetes node termination handler (use AWS Node Termination Handler instead)
# helm install aws-node-termination-handler # eks/aws-node-termination-handler # --namespace kube-system # --set enableSpotInterruptionDraining=true # --set enableRebalanceMonitoring=true # --set enableScheduledEventDraining=true
#
# This automatically:
# 1. Detects 2-min interruption notice
# 2. Cordons the node (no new pods scheduled)
# 3. Drains pods gracefully (respects PodDisruptionBudgets)
# 4. Pods reschedule to healthy nodes before termination
# PodDisruptionBudget β ensures minimum availability during drains
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # Always keep at least 2 replicas running
# Or use: maxUnavailable: 1
selector:
matchLabels:
app: api
---
# Deployment with proper termination handling
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 5
template:
spec:
# Give pods 30s to finish in-flight requests
terminationGracePeriodSeconds: 30
containers:
- name: api
lifecycle:
preStop:
exec:
# Sleep briefly to allow load balancer to drain
command: ["/bin/sh", "-c", "sleep 5"]
# Readiness probe β pod removed from LB when not ready
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
Spot Fleet for Batch Workloads
{
"SpotFleetRequestConfig": {
"IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
"TargetCapacity": 100,
"TargetCapacityUnitType": "vcpu",
"AllocationStrategy": "priceCapacityOptimized",
"InstanceInterruptionBehavior": "terminate",
"LaunchTemplateConfigs": [
{
"LaunchTemplateSpecification": {
"LaunchTemplateId": "lt-0123456789abcdef0",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "c7g.2xlarge", "WeightedCapacity": 8},
{"InstanceType": "c6g.2xlarge", "WeightedCapacity": 8},
{"InstanceType": "c5.2xlarge", "WeightedCapacity": 8},
{"InstanceType": "m7g.xlarge", "WeightedCapacity": 4},
{"InstanceType": "m6g.xlarge", "WeightedCapacity": 4},
{"InstanceType": "m5.xlarge", "WeightedCapacity": 4}
]
}
]
}
}
Auto Scaling Group Mixed Instances Policy
resource "aws_autoscaling_group" "workers" {
name = "spot-workers"
min_size = 2
max_size = 50
desired_capacity = 10
vpc_zone_identifier = module.vpc.private_subnets
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2 # Always keep 2 on-demand
on_demand_percentage_above_base_capacity = 0 # 0% on-demand above base = all spot
spot_allocation_strategy = "price-capacity-optimized"
# price-capacity-optimized: best balance of lowest price + highest availability
# capacity-optimized: prioritize availability over price
# lowest-price: cheapest but highest interruption risk
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.worker.id
version = "$Latest"
}
override {
instance_type = "m7g.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "m6g.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "m7i.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "m6i.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "m5.xlarge"
weighted_capacity = "1"
}
override {
instance_type = "c7g.2xlarge"
weighted_capacity = "2" # 2x CPU weight
}
}
}
tag {
key = "k8s.io/cluster-autoscaler/enabled"
value = "true"
propagate_at_launch = true
}
}
ML Training with Spot Checkpointing
# PyTorch training with spot-aware checkpointing
import torch
import os
import signal
class SpotAwareTrainer:
def __init__(self, model, checkpoint_dir, checkpoint_interval=100):
self.model = model
self.checkpoint_dir = checkpoint_dir
self.checkpoint_interval = checkpoint_interval
self.interrupted = False
# Register SIGTERM handler (sent before spot interruption)
signal.signal(signal.SIGTERM, self._handle_sigterm)
def _handle_sigterm(self, signum, frame):
"""Save checkpoint immediately on spot interruption signal."""
print("SIGTERM received β saving emergency checkpoint")
self.save_checkpoint("emergency")
self.interrupted = True
def save_checkpoint(self, tag=""):
path = os.path.join(self.checkpoint_dir, f"checkpoint_{tag}.pt")
torch.save({
'epoch': self.current_epoch,
'step': self.current_step,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'loss': self.current_loss,
}, path)
# Also upload to S3 immediately
self._upload_to_s3(path)
def load_checkpoint(self):
"""Resume from latest checkpoint on restart."""
checkpoints = sorted([
f for f in os.listdir(self.checkpoint_dir)
if f.startswith('checkpoint_')
])
if checkpoints:
path = os.path.join(self.checkpoint_dir, checkpoints[-1])
checkpoint = torch.load(path)
self.model.load_state_dict(checkpoint['model_state_dict'])
return checkpoint['epoch'], checkpoint['step']
return 0, 0
def train(self, dataloader, epochs):
start_epoch, start_step = self.load_checkpoint()
for epoch in range(start_epoch, epochs):
self.current_epoch = epoch
for step, batch in enumerate(dataloader):
if epoch == start_epoch and step < start_step:
continue # Skip already-processed steps
self.current_step = step
loss = self.training_step(batch)
self.current_loss = loss
# Regular checkpoint
if step % self.checkpoint_interval == 0:
self.save_checkpoint(f"{epoch}_{step}")
if self.interrupted:
print("Training interrupted β resumable from checkpoint")
return
Real-World Savings Reference
Current spot prices (us-east-1, March 2026 approximate):
Instance Type On-Demand Spot Savings
βββββββββββββββββββββββββββββββββββββββββββββ
m7g.xlarge $0.163/hr $0.049/hr 70%
c7g.2xlarge $0.290/hr $0.073/hr 75%
r7g.xlarge $0.254/hr $0.063/hr 75%
g5.xlarge $1.006/hr $0.302/hr 70%
p4d.24xlarge $32.77/hr $9.83/hr 70%
Spot prices fluctuate. Use EC2 Spot Price History to pick pools with
consistent low pricing and rare interruptions.
For 50 spot workers (c7g.2xlarge) running 24/7:
On-demand: $0.290 Γ 50 Γ 720hr = $10,440/month
Spot: $0.073 Γ 50 Γ 720hr = $2,628/month
Saving: $7,812/month
Even accounting for 5% interruption overhead (duplicate work, restarts):
Net saving: $7,400/month
Conclusion
Spot instances are not a niche cost-cutting trick β they are a fundamental part of cost-efficient cloud architecture. The key is accepting that interruptions will happen, designing stateless or checkpointing systems, and diversifying across many instance pools to minimize simultaneous interruptions. Done right, you achieve 70-90% compute cost reduction on the portions of your infrastructure that matter most volumetrically.
Marcus Rodriguez
Lead DevOps Engineer specializing in CI/CD pipelines, container orchestration, and infrastructure automation.