Auto-Scaling Strategies That Actually Save Money: HPA, KEDA, and Scheduled Scaling

The Over-Provisioning Tax

Most cloud infrastructure is sized for peak load and left running 24/7. A typical web application has a traffic pattern that varies 10:1 between peak and off-peak. If you provision for peak, you are paying for 100% capacity while using 10% of it 80% of the time. That is an enormous waste.

Auto-scaling eliminates this waste by matching capacity to demand in real time. The goal is not the lowest possible capacity — it is right-sizing across time: full capacity during peak, minimal capacity at 3am, and fast scale-up when a viral moment hits.

Kubernetes HPA: Beyond CPU Scaling

# Basic HPA (everyone has this)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Scale at 60% CPU average

---
# Advanced HPA: multiple metrics + custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa-advanced
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 100

  metrics:
    # CPU: scale up fast, scale down slow
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

    # Memory: secondary signal
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

    # Custom metric from Prometheus: HTTP request queue depth
    - type: Pods
      pods:
        metric:
          name: http_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"  # Scale if avg > 10 in-flight requests per pod

  # Tuned scaling behavior
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30  # Wait 30s before scaling up again
      policies:
        - type: Percent
          value: 100   # Can double replicas in one step
          periodSeconds: 60
        - type: Pods
          value: 10    # Or add up to 10 pods
          periodSeconds: 60
      selectPolicy: Max  # Use the policy that allows scaling the most

    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale-down (avoid flapping)
      policies:
        - type: Percent
          value: 20   # Scale down at most 20% per 2 minutes
          periodSeconds: 120

KEDA: Event-Driven Scaling to Zero

# KEDA (Kubernetes Event-Driven Autoscaling)
# - Scales from 0 to N based on external event sources
# - Supports 60+ scalers: Kafka, SQS, Redis, Prometheus, Cron, etc.
# - Scale to ZERO = pay nothing when idle

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda   --namespace keda   --create-namespace   --version 2.15.0

# Scale workers based on SQS queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0   # Scale to ZERO when queue is empty!
  maxReplicaCount: 50
  pollingInterval: 15   # Check queue depth every 15 seconds
  cooldownPeriod: 60    # Wait 60s after last trigger before scale-down

  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
        queueLength: "5"        # 1 pod per 5 messages in queue
        awsRegion: us-east-1
        identityOwner: operator  # Use pod identity (IRSA)

---
# Scale based on Kafka consumer lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: event-consumer-scaler
spec:
  scaleTargetRef:
    name: event-consumer
  minReplicaCount: 0
  maxReplicaCount: 30

  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-headless.kafka:9092
        consumerGroup: event-processors
        topic: user-events
        lagThreshold: "100"      # 1 pod per 100 messages of lag
        offsetResetPolicy: latest

---
# Scale Celery workers based on Redis queue length
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: celery-worker-scaler
spec:
  scaleTargetRef:
    name: celery-worker
  minReplicaCount: 0
  maxReplicaCount: 20

  triggers:
    - type: redis
      metadata:
        address: redis:6379
        listName: celery
        listLength: "10"         # 1 pod per 10 queued tasks
      authenticationRef:
        name: redis-auth

Scheduled Scaling for Predictable Patterns

# Use Cron-based scaling for known patterns (business hours, batch windows)

# KEDA Cron scaler for business hours
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-cron-scaler
spec:
  scaleTargetRef:
    name: api
  minReplicaCount: 2
  maxReplicaCount: 50

  triggers:
    # Business hours: more replicas
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * 1-5"    # Mon-Fri 8am
        end:   "0 20 * * 1-5"   # Mon-Fri 8pm
        desiredReplicas: "20"
    
    # Batch processing window: scale up for ETL
    - type: cron
      metadata:
        timezone: UTC
        start: "0 2 * * *"    # 2am UTC every day
        end:   "0 4 * * *"    # 4am UTC every day
        desiredReplicas: "30"

---
# AWS EC2 Auto Scaling scheduled actions for non-Kubernetes workloads
# (Terraform)
resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  autoscaling_group_name = aws_autoscaling_group.workers.name
  recurrence             = "0 8 * * 1-5"  # Mon-Fri 8am UTC
  min_size               = 10
  max_size               = 100
  desired_capacity       = 20
}

resource "aws_autoscaling_schedule" "scale_down_night" {
  scheduled_action_name  = "scale-down-night"
  autoscaling_group_name = aws_autoscaling_group.workers.name
  recurrence             = "0 22 * * 1-5"  # Mon-Fri 10pm UTC
  min_size               = 2
  max_size               = 100
  desired_capacity       = 4
}

resource "aws_autoscaling_schedule" "scale_down_weekend" {
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.workers.name
  recurrence             = "0 0 * * 6"   # Saturday midnight
  min_size               = 1
  max_size               = 100
  desired_capacity       = 2
}

AWS Predictive Scaling

# AWS Predictive Scaling uses ML to forecast demand and pre-scale
# Eliminates the "lag" between traffic spike and scale-up
# Requires 14+ days of CloudWatch metrics history

resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.web.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70  # Target 70% CPU

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
      }
      
      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
      }
    }

    scheduling_buffer_time    = 300  # Pre-scale 5 minutes early
    max_capacity_breach_behavior = "IncreaseMaxCapacity"
    max_capacity_buffer       = 10   # Allow 10% above max if needed
    mode                      = "ForecastAndScale"  # vs ForecastOnly
  }
}

Scale-to-Zero for Development Environments

# Use Kubernetes CronJob to scale dev namespaces to zero at night

cat > scale-down.yaml << 'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-scale-down
  namespace: kube-system
spec:
  schedule: "0 19 * * 1-5"  # 7pm weekdays
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: scaler
          restartPolicy: OnFailure
          containers:
            - name: kubectl
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  for ns in dev staging preview; do
                    kubectl scale deployment --all --replicas=0 -n $ns
                    kubectl scale statefulset --all --replicas=0 -n $ns
                  done
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dev-scale-up
  namespace: kube-system
spec:
  schedule: "0 8 * * 1-5"  # 8am weekdays
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: scaler
          restartPolicy: OnFailure
          containers:
            - name: kubectl
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  for ns in dev staging; do
                    kubectl scale deployment --all --replicas=1 -n $ns
                  done
EOF
kubectl apply -f scale-down.yaml

# Expected savings for a 20-node dev cluster running 8hrs/day, 5days/week:
# Without scaling: 20 nodes × 24hr × 30 days × $0.163/hr = $2,347/month
# With scale-down: 20 nodes × 10hr × 22 days × $0.163/hr = $717/month
# Saving: $1,630/month (69%)

Cost Impact of Auto-Scaling

Before auto-scaling (static fleet sized for peak):
  24 nodes running 24/7
  Monthly cost: 24 × $0.163/hr × 720hr = $2,814/month

After implementing HPA + KEDA + scheduled scaling:
  Traffic pattern (typical web app):
    8am-8pm weekdays (12hr): 24 nodes needed
    8pm-8am weekdays (12hr): 8 nodes needed
    Weekends (48hr):         6 nodes needed
  
  Actual cost calculation:
    Peak (12hr × 22 workdays):  24 nodes × $0.163 × 264hr = $1,033
    Off-peak weekdays:          8 nodes × $0.163 × 264hr =  $344
    Weekends:                   6 nodes × $0.163 × 192hr =  $188
    Total: $1,565/month
  
  Savings: $1,249/month (44%) — without any application changes

With scale-to-zero for dev/staging (additional):
  Dev cluster: $2,347 → $717 = -$1,630/month
  
Combined total saving: $2,879/month (52%)

Conclusion

Auto-scaling is the infrastructure equivalent of just-in-time manufacturing — maintain only the capacity you need, when you need it. The combination of HPA for real-time CPU/memory scaling, KEDA for event-driven and queue-based scaling, cron-based scheduled scaling for predictable patterns, and scale-to-zero for non-production environments addresses every dimension of the over-provisioning problem.

The configurations above are production-ready and battle-tested. Implement them in order of ROI: scale-to-zero for dev environments first (immediate savings, zero risk), then scheduled scaling, then KEDA for queue workers, then advanced HPA tuning for production services.

Auto-Scaling Strategies That Actually Save Money: HPA, KEDA, and Scheduled Scaling

The Over-Provisioning Tax

Kubernetes HPA: Beyond CPU Scaling

KEDA: Event-Driven Scaling to Zero

Scheduled Scaling for Predictable Patterns

AWS Predictive Scaling

Scale-to-Zero for Development Environments

Cost Impact of Auto-Scaling

Conclusion

Tags

Related Articles

AWS Budgets and Cost Anomaly Detection: Automated FinOps Guardrails

Cloud Tagging Strategy at Scale: Enforcing Cost Allocation Across 100+ AWS Accounts

Multi-Cloud Networking Costs: Transit Gateway, VPC Peering, and Cross-Cloud Egress

Ready to Transform Your Infrastructure?