The Over-Provisioning Tax
Most cloud infrastructure is sized for peak load and left running 24/7. A typical web application has a traffic pattern that varies 10:1 between peak and off-peak. If you provision for peak, you are paying for 100% capacity while using 10% of it 80% of the time. That is an enormous waste.
Auto-scaling eliminates this waste by matching capacity to demand in real time. The goal is not the lowest possible capacity — it is right-sizing across time: full capacity during peak, minimal capacity at 3am, and fast scale-up when a viral moment hits.
Kubernetes HPA: Beyond CPU Scaling
# Basic HPA (everyone has this)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale at 60% CPU average
---
# Advanced HPA: multiple metrics + custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa-advanced
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 100
metrics:
# CPU: scale up fast, scale down slow
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# Memory: secondary signal
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
# Custom metric from Prometheus: HTTP request queue depth
- type: Pods
pods:
metric:
name: http_requests_in_flight
target:
type: AverageValue
averageValue: "10" # Scale if avg > 10 in-flight requests per pod
# Tuned scaling behavior
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # Wait 30s before scaling up again
policies:
- type: Percent
value: 100 # Can double replicas in one step
periodSeconds: 60
- type: Pods
value: 10 # Or add up to 10 pods
periodSeconds: 60
selectPolicy: Max # Use the policy that allows scaling the most
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scale-down (avoid flapping)
policies:
- type: Percent
value: 20 # Scale down at most 20% per 2 minutes
periodSeconds: 120
KEDA: Event-Driven Scaling to Zero
# KEDA (Kubernetes Event-Driven Autoscaling)
# - Scales from 0 to N based on external event sources
# - Supports 60+ scalers: Kafka, SQS, Redis, Prometheus, Cron, etc.
# - Scale to ZERO = pay nothing when idle
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace --version 2.15.0
# Scale workers based on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # Scale to ZERO when queue is empty!
maxReplicaCount: 50
pollingInterval: 15 # Check queue depth every 15 seconds
cooldownPeriod: 60 # Wait 60s after last trigger before scale-down
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5" # 1 pod per 5 messages in queue
awsRegion: us-east-1
identityOwner: operator # Use pod identity (IRSA)
---
# Scale based on Kafka consumer lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: event-consumer-scaler
spec:
scaleTargetRef:
name: event-consumer
minReplicaCount: 0
maxReplicaCount: 30
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-headless.kafka:9092
consumerGroup: event-processors
topic: user-events
lagThreshold: "100" # 1 pod per 100 messages of lag
offsetResetPolicy: latest
---
# Scale Celery workers based on Redis queue length
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: celery-worker-scaler
spec:
scaleTargetRef:
name: celery-worker
minReplicaCount: 0
maxReplicaCount: 20
triggers:
- type: redis
metadata:
address: redis:6379
listName: celery
listLength: "10" # 1 pod per 10 queued tasks
authenticationRef:
name: redis-auth
Scheduled Scaling for Predictable Patterns
# Use Cron-based scaling for known patterns (business hours, batch windows)
# KEDA Cron scaler for business hours
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-cron-scaler
spec:
scaleTargetRef:
name: api
minReplicaCount: 2
maxReplicaCount: 50
triggers:
# Business hours: more replicas
- type: cron
metadata:
timezone: America/New_York
start: "0 8 * * 1-5" # Mon-Fri 8am
end: "0 20 * * 1-5" # Mon-Fri 8pm
desiredReplicas: "20"
# Batch processing window: scale up for ETL
- type: cron
metadata:
timezone: UTC
start: "0 2 * * *" # 2am UTC every day
end: "0 4 * * *" # 4am UTC every day
desiredReplicas: "30"
---
# AWS EC2 Auto Scaling scheduled actions for non-Kubernetes workloads
# (Terraform)
resource "aws_autoscaling_schedule" "scale_up_morning" {
scheduled_action_name = "scale-up-morning"
autoscaling_group_name = aws_autoscaling_group.workers.name
recurrence = "0 8 * * 1-5" # Mon-Fri 8am UTC
min_size = 10
max_size = 100
desired_capacity = 20
}
resource "aws_autoscaling_schedule" "scale_down_night" {
scheduled_action_name = "scale-down-night"
autoscaling_group_name = aws_autoscaling_group.workers.name
recurrence = "0 22 * * 1-5" # Mon-Fri 10pm UTC
min_size = 2
max_size = 100
desired_capacity = 4
}
resource "aws_autoscaling_schedule" "scale_down_weekend" {
scheduled_action_name = "scale-down-weekend"
autoscaling_group_name = aws_autoscaling_group.workers.name
recurrence = "0 0 * * 6" # Saturday midnight
min_size = 1
max_size = 100
desired_capacity = 2
}
AWS Predictive Scaling
# AWS Predictive Scaling uses ML to forecast demand and pre-scale
# Eliminates the "lag" between traffic spike and scale-up
# Requires 14+ days of CloudWatch metrics history
resource "aws_autoscaling_policy" "predictive" {
name = "predictive-scaling"
autoscaling_group_name = aws_autoscaling_group.web.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 70 # Target 70% CPU
predefined_scaling_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
predefined_load_metric_specification {
predefined_metric_type = "ASGTotalCPUUtilization"
}
}
scheduling_buffer_time = 300 # Pre-scale 5 minutes early
max_capacity_breach_behavior = "IncreaseMaxCapacity"
max_capacity_buffer = 10 # Allow 10% above max if needed
mode = "ForecastAndScale" # vs ForecastOnly
}
}
Scale-to-Zero for Development Environments
# Use Kubernetes CronJob to scale dev namespaces to zero at night
cat > scale-down.yaml << 'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
name: dev-scale-down
namespace: kube-system
spec:
schedule: "0 19 * * 1-5" # 7pm weekdays
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
restartPolicy: OnFailure
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
for ns in dev staging preview; do
kubectl scale deployment --all --replicas=0 -n $ns
kubectl scale statefulset --all --replicas=0 -n $ns
done
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: dev-scale-up
namespace: kube-system
spec:
schedule: "0 8 * * 1-5" # 8am weekdays
jobTemplate:
spec:
template:
spec:
serviceAccountName: scaler
restartPolicy: OnFailure
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
for ns in dev staging; do
kubectl scale deployment --all --replicas=1 -n $ns
done
EOF
kubectl apply -f scale-down.yaml
# Expected savings for a 20-node dev cluster running 8hrs/day, 5days/week:
# Without scaling: 20 nodes Ă— 24hr Ă— 30 days Ă— $0.163/hr = $2,347/month
# With scale-down: 20 nodes Ă— 10hr Ă— 22 days Ă— $0.163/hr = $717/month
# Saving: $1,630/month (69%)
Cost Impact of Auto-Scaling
Before auto-scaling (static fleet sized for peak):
24 nodes running 24/7
Monthly cost: 24 Ă— $0.163/hr Ă— 720hr = $2,814/month
After implementing HPA + KEDA + scheduled scaling:
Traffic pattern (typical web app):
8am-8pm weekdays (12hr): 24 nodes needed
8pm-8am weekdays (12hr): 8 nodes needed
Weekends (48hr): 6 nodes needed
Actual cost calculation:
Peak (12hr Ă— 22 workdays): 24 nodes Ă— $0.163 Ă— 264hr = $1,033
Off-peak weekdays: 8 nodes Ă— $0.163 Ă— 264hr = $344
Weekends: 6 nodes Ă— $0.163 Ă— 192hr = $188
Total: $1,565/month
Savings: $1,249/month (44%) — without any application changes
With scale-to-zero for dev/staging (additional):
Dev cluster: $2,347 → $717 = -$1,630/month
Combined total saving: $2,879/month (52%)
Conclusion
Auto-scaling is the infrastructure equivalent of just-in-time manufacturing — maintain only the capacity you need, when you need it. The combination of HPA for real-time CPU/memory scaling, KEDA for event-driven and queue-based scaling, cron-based scheduled scaling for predictable patterns, and scale-to-zero for non-production environments addresses every dimension of the over-provisioning problem.
The configurations above are production-ready and battle-tested. Implement them in order of ROI: scale-to-zero for dev environments first (immediate savings, zero risk), then scheduled scaling, then KEDA for queue workers, then advanced HPA tuning for production services.
Marcus Rodriguez
Lead DevOps Engineer specializing in CI/CD pipelines, container orchestration, and infrastructure automation.