AWS Cost Optimization: 25 Proven Strategies to Cut Your Bill by 40–60%

The AWS Bill Problem

The average AWS customer wastes 32% of their cloud spend according to Flexera's 2025 State of the Cloud report. That is not a rounding error — it is one-third of your bill going to resources nobody is using, instances sized for peak load that run at 8% utilization, and data transfer fees that quietly compound every month.

The good news: cloud waste is highly recoverable. Unlike most cost problems, cloud overspend is largely technical, not political. The right tooling and the right architecture decisions can cut a $50,000/month AWS bill to $25,000 without any visible performance degradation.

Category 1: Compute Optimization

Strategy 1: Right-Size EC2 Instances

# Use AWS Compute Optimizer to identify right-sizing opportunities
aws compute-optimizer get-ec2-instance-recommendations   --region us-east-1   --query 'instanceRecommendations[?finding=='OVER_PROVISIONED'].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].estimatedMonthlySavings.value]'   --output table

# Common pattern: r5.2xlarge (64GB RAM) running at 15% memory utilization
# Right-size to: r5.large (16GB RAM) — save $200-400/month per instance

# EC2 utilization benchmarks for right-sizing decisions:
# CPU < 40% average  → downsize CPU
# Memory < 50%       → downsize memory
# Network < 20%      → check if network-optimized instance needed
# Disk IOPS < 30%    → switch from io1 to gp3

Strategy 2: Graviton (ARM) Instances — 20-40% Cheaper

# Graviton3 (c7g, m7g, r7g) vs equivalent x86 Intel
# m7g.xlarge:  $0.1632/hr  (4 vCPU, 16GB)
# m6i.xlarge:  $0.192/hr   (4 vCPU, 16GB)
# Savings:     15% — but Graviton often outperforms on compute

# Graviton4 (c8g, m8g) released 2025 — 30% faster than Graviton3
# r8g.2xlarge: $0.4032/hr vs r6i.2xlarge: $0.504/hr
# Savings: 20% + better performance

# Test your workload on Graviton with minimal risk:
# 1. Spin up a Graviton instance
# 2. Deploy your Docker image (multi-arch builds work automatically)
# 3. Run load test and compare performance
# 4. If equal or better: migrate

# Multi-arch Docker build (required for Graviton):
docker buildx build   --platform linux/amd64,linux/arm64   -t myapp:latest   --push .

Strategy 3: Spot Instances for Fault-Tolerant Workloads

# Spot instances: up to 90% cheaper than on-demand
# Use for: batch jobs, CI/CD, ML training, dev environments, stateless services

# EKS Node Group with mixed On-Demand + Spot
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1

managedNodeGroups:
  - name: critical-ondemand
    instanceType: m7g.xlarge  # Graviton
    minSize: 3
    maxSize: 3
    # On-demand for stateful/critical workloads

  - name: burst-spot
    instanceTypes:  # Multiple types for better availability
      - m7g.xlarge
      - m7g.2xlarge
      - c7g.xlarge
      - c7g.2xlarge
      - m6g.xlarge
    spot: true
    minSize: 0
    maxSize: 50
    labels:
      workload-type: spot-tolerant
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

# Kubernetes deployment for spot-tolerant workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-workers
spec:
  replicas: 10
  template:
    spec:
      tolerations:
        - key: spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: workload-type
                    operator: In
                    values: [spot-tolerant]

Strategy 4: Savings Plans and Reserved Instances

Commitment discounts — the biggest single lever for steady-state workloads:

Compute Savings Plans (most flexible):
  1-year, no upfront: 40% off on-demand
  1-year, partial upfront: 43% off
  1-year, all upfront: 45% off
  3-year, all upfront: 66% off

EC2 Instance Savings Plans (less flexible, specific family):
  Additional 5-10% vs Compute Savings Plans
  
Reserved Instances (specific AZ/region):
  Standard RI (1-yr all upfront): 40-60% off depending on instance type

Best practice - "Right-sizing before reserving":
  1. Run on-demand for 3 months
  2. Analyze with Cost Explorer + Compute Optimizer
  3. Right-size instances (Graviton if possible)
  4. THEN purchase Savings Plans for the stable baseline
  
Never reserve before right-sizing! A 3-year RI on an oversized instance
is expensive insurance for a problem you should have fixed first.

Typical enterprise result:
  On-demand EC2: $40,000/month
  After right-sizing: $28,000/month
  After 1-year Compute Savings Plan: $16,800/month
  Total savings: 58%

Category 2: Storage Optimization

Strategy 5: EBS Volume Audit — Stop Paying for Detached Volumes

# Find all unattached EBS volumes
aws ec2 describe-volumes   --filters Name=status,Values=available   --query 'Volumes[*].[VolumeId,Size,VolumeType,CreateTime]'   --output table

# Find EBS snapshots older than 90 days with no associated volume
aws ec2 describe-snapshots   --owner-ids self   --query 'Snapshots[?StartTime<='2025-12-01'].[SnapshotId,VolumeId,VolumeSize,StartTime]'   --output table

# Delete unattached volumes (after confirming not needed!)
aws ec2 delete-volume --volume-id vol-xxxxxxxxx

# Automate cleanup with a Lambda (safe — checks for recent detach):
# python cleanup script
import boto3
from datetime import datetime, timezone, timedelta

ec2 = boto3.client('ec2', region_name='us-east-1')
volumes = ec2.describe_volumes(Filters=[{'Name': 'status', 'Values': ['available']}])
for vol in volumes['Volumes']:
    age_days = (datetime.now(timezone.utc) - vol['CreateTime']).days
    if age_days > 30:
        print(f"Candidate for deletion: {vol['VolumeId']} ({vol['Size']}GB, {age_days} days old)")

Strategy 6: S3 Intelligent-Tiering for Unknown Access Patterns

# S3 storage class costs (per GB/month):
# Standard:              $0.023
# Intelligent-Tiering:   $0.023 (frequent) / $0.0125 (infrequent) / $0.004 (archive instant)
# Standard-IA:           $0.0125 (+ $0.01/GB retrieval)
# Glacier Instant:       $0.004 (+ $0.03/GB retrieval)
# Glacier Flexible:      $0.0036 (3-5 hour retrieval)
# Glacier Deep Archive:  $0.00099 (12 hour retrieval)

# Enable Intelligent-Tiering on existing bucket
aws s3api put-bucket-intelligent-tiering-configuration   --bucket my-data-bucket   --id main-config   --intelligent-tiering-configuration '{
    "Id": "main-config",
    "Status": "Enabled",
    "Tierings": [
      {"Days": 90, "AccessTier": "ARCHIVE_INSTANT_ACCESS"},
      {"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
    ]
  }'

# S3 Lifecycle policy for log data (common savings: 70-80%)
aws s3api put-bucket-lifecycle-configuration   --bucket application-logs   --lifecycle-configuration '{
    "Rules": [{
      "ID": "log-tiering",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30, "StorageClass": "STANDARD_IA"},
        {"Days": 90, "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 730}
    }]
  }'

Strategy 7: EBS gp2 to gp3 Migration — Free Performance Upgrade

# gp3 is 20% cheaper than gp2 AND includes 3000 IOPS baseline (vs gp2's 100-3000 variable)
# gp2: $0.10/GB/month
# gp3: $0.08/GB/month + $0.005/IOPS above 3000 + $0.04/MB/s above 125MB/s

# Migrate all gp2 volumes to gp3
aws ec2 describe-volumes   --filters Name=volume-type,Values=gp2   --query 'Volumes[*].VolumeId'   --output text | tr '	' '
' | while read vid; do
    echo "Migrating $vid to gp3..."
    aws ec2 modify-volume --volume-id "$vid" --volume-type gp3
done

# No downtime required — modification happens live
# For 1TB gp2: $100/month → $80/month (save $20/month, $240/year per volume)

Category 3: Database Cost Optimization

Strategy 8: RDS Multi-AZ Only When You Need It

RDS Multi-AZ costs exactly 2x the single-AZ price.

When Multi-AZ is worth it:
  ✓ Production databases serving customers
  ✓ RTO requirement under 2 minutes
  ✓ Regulatory requirement for HA

When Multi-AZ is NOT worth it (and where to save):
  ✗ Development databases → use single-AZ + snapshot backup
  ✗ Staging databases → single-AZ, restore from prod snapshot
  ✗ Analytics databases with offline ETL → snapshot-based recovery fine
  ✗ Read replicas → already provide read HA; primary can be single-AZ

Example savings:
  db.r6g.2xlarge Multi-AZ: $1,540/month
  db.r6g.2xlarge Single-AZ: $770/month
  Save: $770/month for non-production workloads

Best practice for dev/staging:
  - Use RDS scheduler to stop instances at night (save 60% of compute)
  - Aurora Serverless v2 for dev: scales to 0 ACUs when idle, pay only for use

Strategy 9: Aurora Serverless v2 for Variable Workloads

resource "aws_rds_cluster" "api_db" {
  cluster_identifier = "api-db"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"
  engine_version     = "16.4"
  
  serverlessv2_scaling_configuration {
    min_capacity = 0.5   # 0.5 ACUs = ~1GB RAM (minimum cost when idle)
    max_capacity = 32    # 32 ACUs = ~64GB RAM (max burst)
  }
  
  database_name   = "appdb"
  master_username = "admin"
  manage_master_user_password = true
}

resource "aws_rds_cluster_instance" "api_db" {
  identifier         = "api-db-instance"
  cluster_identifier = aws_rds_cluster.api_db.id
  instance_class     = "db.serverless"  # Required for Serverless v2
  engine             = aws_rds_cluster.api_db.engine
}

# Aurora Serverless v2 pricing:
# $0.12 per ACU-hour (us-east-1)
# At 0.5 ACU (idle): $0.06/hr = $1.44/day = $43/month minimum
# At 4 ACU (normal): $0.48/hr = $11.52/day = $346/month
# At 16 ACU (peak): $1.92/hr (only for short bursts)
# 
# vs Provisioned db.r6g.2xlarge: $0.77/hr = $556/month constant
# Savings for variable workloads: 30-70%

Category 4: Network and Data Transfer

Strategy 10: Data Transfer Cost Audit

# Data transfer is the sneakiest AWS cost — check your bill
# Common egress costs:
# EC2 to Internet: $0.09/GB (first 10TB/month)
# EC2 cross-AZ: $0.01/GB each direction (= $0.02/GB round trip)
# EC2 to S3 same region: FREE
# EC2 to CloudFront: FREE (origin fetch is free; CloudFront charges for edge delivery)
# NAT Gateway: $0.045/GB processed + $0.045/hr gateway charge

# Find your top data transfer costs
aws ce get-cost-and-usage   --time-period Start=2026-02-01,End=2026-03-01   --granularity MONTHLY   --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Elastic Compute Cloud - Compute"]}}'   --group-by '[{"Type": "DIMENSION", "Key": "USAGE_TYPE"}]'   --metrics BlendedCost   --query 'ResultsByTime[0].Groups[?contains(Keys[0], 'DataTransfer')].[Keys[0],Metrics.BlendedCost.Amount]'   --output table

# NAT Gateway is frequently the #1 surprise cost
# Each AZ needs a NAT Gateway: $0.045/hr × 3 AZs = $97/month just for the gateways
# Plus $0.045/GB of traffic processed

# Solutions:
# 1. Use VPC Endpoints for S3 and DynamoDB (free, eliminates NAT for those services)
# 2. Use PrivateLink for other AWS services
# 3. Ensure instances download from S3 directly (not via NAT)

Strategy 11: VPC Endpoints to Eliminate NAT Gateway Costs

# S3 Gateway endpoint — FREE, eliminates $0.045/GB NAT processing for S3
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
  
  route_table_ids = [
    aws_route_table.private_a.id,
    aws_route_table.private_b.id,
    aws_route_table.private_c.id,
  ]
}

# DynamoDB Gateway endpoint — FREE
resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.dynamodb"
  route_table_ids = [aws_route_table.private_a.id]
}

# ECR Interface endpoint — reduces NAT Gateway traffic for Docker pulls
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

# Typical result: teams pulling Docker images from ECR via NAT Gateway
# were paying $200-500/month in NAT Gateway processing fees alone.
# Interface endpoints cost $0.01/hr + $0.01/GB = much cheaper than NAT.

Category 5: Lambda and Serverless

Strategy 12: Lambda Memory Right-Sizing

# Lambda pricing: $0.0000166667 per GB-second
# More memory = more cost PER INVOCATION but faster execution
# Optimal memory = sweet spot of lowest GB-seconds consumed

# Use AWS Lambda Power Tuning (open source tool)
# It runs your function at multiple memory sizes and finds the optimum

# Manual approach: CloudWatch Insights query
# fields @timestamp, @billedDuration, @memorySize, @maxMemoryUsed
# | stats avg(@billedDuration) as avgDuration,
#         avg(@maxMemoryUsed) as avgMemUsed,
#         avg(@memorySize) as allocatedMem
# | filter @type = "REPORT"

# Cost formula: (memoryGB × durationSeconds × invocations) × $0.0000166667

# Example: Function at 512MB averaging 800ms, 1M invocations/month
# Cost: 0.5GB × 0.8s × 1,000,000 × $0.0000166667 = $6.67/month

# Same function at 1024MB averaging 350ms:
# Cost: 1.0GB × 0.35s × 1,000,000 × $0.0000166667 = $5.83/month
# Faster AND cheaper! More memory = faster cold starts + faster execution.

# Rule of thumb: double memory, check if duration halves
# If it does: same cost. If duration drops 60%+: cheaper AND faster.

Strategy 13: Lambda Graviton2 for Free 20% Savings

resource "aws_lambda_function" "api" {
  filename      = "function.zip"
  function_name = "api-handler"
  role          = aws_iam_role.lambda.arn
  handler       = "main.handler"
  runtime       = "python3.12"
  
  # Graviton2 = 20% cheaper than x86 + better performance
  architectures = ["arm64"]  # Change from default ["x86_64"]
  
  memory_size = 512
  timeout     = 30
}

# Python, Node.js, Java, Go, Ruby all support arm64 on Lambda
# Only requirement: your dependencies must be arm64 compatible
# For Python/Node: virtually all packages work without changes

Category 6: Kubernetes Cost Optimization

Strategy 14: Karpenter for Node Autoscaling

# Karpenter vs Cluster Autoscaler:
# - Karpenter provisions nodes in seconds (vs minutes for CA)
# - Karpenter consolidates nodes automatically (bin-packing)
# - Karpenter selects cheapest instance type for each workload

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["5"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64"]  # Prefer Graviton
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]  # Try spot first
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  
  disruption:
    consolidationPolicy: WhenUnderutilized  # Bin-pack and remove empty nodes
    consolidateAfter: 30s  # Aggressive consolidation
    expireAfter: 720h       # Recycle nodes every 30 days

Strategy 15: Vertical Pod Autoscaler (VPA) for Right-Sized Requests

# VPA analyzes actual resource usage and recommends/applies optimal requests
# Correct resource requests = correct node scheduling = no wasted node capacity

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Auto"  # Automatically apply recommendations
    # "Off" = only recommend, "Initial" = apply only on new pods
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]

# Check VPA recommendations:
# kubectl describe vpa api-vpa
# Look for "Recommendation" section showing target requests/limits

Category 7: Monitoring and FinOps Tooling

Strategy 16: AWS Cost Anomaly Detection

resource "aws_ce_anomaly_monitor" "main" {
  name         = "aws-cost-monitor"
  monitor_type = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "main" {
  name      = "cost-anomaly-alerts"
  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert if anomaly > $100
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
  frequency = "DAILY"
  monitor_arn_list = [aws_ce_anomaly_monitor.main.arn]
  
  subscriber {
    address = "devops@company.com"
    type    = "EMAIL"
  }
  
  subscriber {
    address = aws_sns_topic.cost_alerts.arn
    type    = "SNS"
  }
}

# AWS Budgets: hard limits with alerts
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-cost-budget"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["devops@company.com"]
  }
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["cto@company.com"]
  }
}

Strategies 17-25: Quick Wins Reference

Strategy 17 — Stop idle resources at night
  Lambda + EventBridge to stop dev/staging EC2 and RDS at 7pm, start at 7am
  Save: 65% of dev/staging compute (16 hours off per day)

Strategy 18 — Delete unused Elastic IPs
  aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]'
  Each unattached EIP = $0.005/hr = $3.65/month (small but easy win)

Strategy 19 — S3 Request Cost Optimization
  Use S3 Transfer Acceleration only where needed (adds cost)
  Enable S3 Request Metrics to find buckets with excessive LIST/HEAD calls
  Requester-pays for shared data sets accessed by multiple teams

Strategy 20 — CloudWatch Logs retention
  Default: never expire. Set retention on all log groups.
  aws logs put-retention-policy --log-group-name /aws/lambda/func --retention-in-days 30
  CloudWatch Logs: $0.50/GB ingestion + $0.03/GB storage — adds up fast

Strategy 21 — Use CloudFront for S3 static content
  EC2/S3 direct egress: $0.09/GB
  CloudFront: $0.0085/GB (first 10TB, much cheaper for high-volume)
  Also: fewer requests reach S3 → lower S3 request costs

Strategy 22 — Aurora auto-pause for dev databases
  Aurora Serverless v2 with min 0 ACUs + auto-pause after 5 min idle
  Dev database with 8 hours/day use: pay for 8 hours, not 24

Strategy 23 — Consolidate CloudTrail trails
  Multi-region trails: $2/100k events. Single trail + S3 + Athena is cheaper than CloudWatch Logs
  Disable management event logging in non-production accounts

Strategy 24 — Review NAT Gateway vs NAT Instance
  NAT Gateway: $0.045/hr + $0.045/GB
  NAT Instance (t4g.small): $0.0168/hr + no per-GB charge
  For 500GB/month traffic: NAT Instance saves $250/month per AZ

Strategy 25 — Tag everything, use Cost Allocation Tags
  Without tags: you cannot attribute costs to teams/products
  With tags: charge-back drives ownership, teams self-optimize
  Required tags: Project, Environment, Owner, CostCenter

Putting It Together: The 90-Day Cost Optimization Sprint

Week 1-2: Discovery
  - Enable Cost Explorer and set up tagging
  - Run Compute Optimizer on all EC2 and Lambda
  - Audit EBS volumes (find unattached)
  - Audit EIPs and NAT Gateways
  - Check S3 lifecycle policies

Week 3-4: Quick wins (no architecture changes)
  - Delete unattached EBS volumes and old snapshots
  - gp2 → gp3 migration (all volumes)
  - Add S3 lifecycle policies
  - Add CloudWatch Logs retention policies
  - Stop dev/staging instances at night
  Expected: 10-15% savings

Month 2: Right-sizing and Graviton
  - Right-size top 20 EC2 instances
  - Migrate Lambda to arm64 (Graviton)
  - Add VPC Endpoints for S3/DynamoDB/ECR
  - Evaluate Spot for dev/test workloads
  Expected: additional 15-20% savings

Month 3: Commitments and architecture
  - Purchase Savings Plans for stable baseline (after right-sizing)
  - Migrate eligible dev RDS to Aurora Serverless v2
  - Deploy Karpenter for Kubernetes workloads
  - Implement VPA recommendations
  Expected: additional 15-25% savings

Total after 90 days: 40-60% reduction

Conclusion

Cloud cost optimization is a continuous practice, not a one-time project. The 25 strategies above are roughly ordered by ROI — the quick wins in weeks 1-2 alone can save 10-15% with a few hours of work. Graviton migration, right-sizing, and Savings Plans then compound those savings significantly.

The most important practice is visibility: tag everything, set up anomaly detection, and review Cost Explorer weekly. Teams that treat cloud spend as an engineering metric — not just a finance concern — consistently outperform those that don't by 2-3x on cost efficiency.