The AWS Bill Problem
The average AWS customer wastes 32% of their cloud spend according to Flexera's 2025 State of the Cloud report. That is not a rounding error — it is one-third of your bill going to resources nobody is using, instances sized for peak load that run at 8% utilization, and data transfer fees that quietly compound every month.
The good news: cloud waste is highly recoverable. Unlike most cost problems, cloud overspend is largely technical, not political. The right tooling and the right architecture decisions can cut a $50,000/month AWS bill to $25,000 without any visible performance degradation.
Category 1: Compute Optimization
Strategy 1: Right-Size EC2 Instances
# Use AWS Compute Optimizer to identify right-sizing opportunities
aws compute-optimizer get-ec2-instance-recommendations --region us-east-1 --query 'instanceRecommendations[?finding=='OVER_PROVISIONED'].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].estimatedMonthlySavings.value]' --output table
# Common pattern: r5.2xlarge (64GB RAM) running at 15% memory utilization
# Right-size to: r5.large (16GB RAM) — save $200-400/month per instance
# EC2 utilization benchmarks for right-sizing decisions:
# CPU < 40% average → downsize CPU
# Memory < 50% → downsize memory
# Network < 20% → check if network-optimized instance needed
# Disk IOPS < 30% → switch from io1 to gp3
Strategy 2: Graviton (ARM) Instances — 20-40% Cheaper
# Graviton3 (c7g, m7g, r7g) vs equivalent x86 Intel
# m7g.xlarge: $0.1632/hr (4 vCPU, 16GB)
# m6i.xlarge: $0.192/hr (4 vCPU, 16GB)
# Savings: 15% — but Graviton often outperforms on compute
# Graviton4 (c8g, m8g) released 2025 — 30% faster than Graviton3
# r8g.2xlarge: $0.4032/hr vs r6i.2xlarge: $0.504/hr
# Savings: 20% + better performance
# Test your workload on Graviton with minimal risk:
# 1. Spin up a Graviton instance
# 2. Deploy your Docker image (multi-arch builds work automatically)
# 3. Run load test and compare performance
# 4. If equal or better: migrate
# Multi-arch Docker build (required for Graviton):
docker buildx build --platform linux/amd64,linux/arm64 -t myapp:latest --push .
Strategy 3: Spot Instances for Fault-Tolerant Workloads
# Spot instances: up to 90% cheaper than on-demand
# Use for: batch jobs, CI/CD, ML training, dev environments, stateless services
# EKS Node Group with mixed On-Demand + Spot
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
managedNodeGroups:
- name: critical-ondemand
instanceType: m7g.xlarge # Graviton
minSize: 3
maxSize: 3
# On-demand for stateful/critical workloads
- name: burst-spot
instanceTypes: # Multiple types for better availability
- m7g.xlarge
- m7g.2xlarge
- c7g.xlarge
- c7g.2xlarge
- m6g.xlarge
spot: true
minSize: 0
maxSize: 50
labels:
workload-type: spot-tolerant
taints:
- key: spot
value: "true"
effect: NoSchedule
# Kubernetes deployment for spot-tolerant workload
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-workers
spec:
replicas: 10
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values: [spot-tolerant]
Strategy 4: Savings Plans and Reserved Instances
Commitment discounts — the biggest single lever for steady-state workloads:
Compute Savings Plans (most flexible):
1-year, no upfront: 40% off on-demand
1-year, partial upfront: 43% off
1-year, all upfront: 45% off
3-year, all upfront: 66% off
EC2 Instance Savings Plans (less flexible, specific family):
Additional 5-10% vs Compute Savings Plans
Reserved Instances (specific AZ/region):
Standard RI (1-yr all upfront): 40-60% off depending on instance type
Best practice - "Right-sizing before reserving":
1. Run on-demand for 3 months
2. Analyze with Cost Explorer + Compute Optimizer
3. Right-size instances (Graviton if possible)
4. THEN purchase Savings Plans for the stable baseline
Never reserve before right-sizing! A 3-year RI on an oversized instance
is expensive insurance for a problem you should have fixed first.
Typical enterprise result:
On-demand EC2: $40,000/month
After right-sizing: $28,000/month
After 1-year Compute Savings Plan: $16,800/month
Total savings: 58%
Category 2: Storage Optimization
Strategy 5: EBS Volume Audit — Stop Paying for Detached Volumes
# Find all unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available --query 'Volumes[*].[VolumeId,Size,VolumeType,CreateTime]' --output table
# Find EBS snapshots older than 90 days with no associated volume
aws ec2 describe-snapshots --owner-ids self --query 'Snapshots[?StartTime<='2025-12-01'].[SnapshotId,VolumeId,VolumeSize,StartTime]' --output table
# Delete unattached volumes (after confirming not needed!)
aws ec2 delete-volume --volume-id vol-xxxxxxxxx
# Automate cleanup with a Lambda (safe — checks for recent detach):
# python cleanup script
import boto3
from datetime import datetime, timezone, timedelta
ec2 = boto3.client('ec2', region_name='us-east-1')
volumes = ec2.describe_volumes(Filters=[{'Name': 'status', 'Values': ['available']}])
for vol in volumes['Volumes']:
age_days = (datetime.now(timezone.utc) - vol['CreateTime']).days
if age_days > 30:
print(f"Candidate for deletion: {vol['VolumeId']} ({vol['Size']}GB, {age_days} days old)")
Strategy 6: S3 Intelligent-Tiering for Unknown Access Patterns
# S3 storage class costs (per GB/month):
# Standard: $0.023
# Intelligent-Tiering: $0.023 (frequent) / $0.0125 (infrequent) / $0.004 (archive instant)
# Standard-IA: $0.0125 (+ $0.01/GB retrieval)
# Glacier Instant: $0.004 (+ $0.03/GB retrieval)
# Glacier Flexible: $0.0036 (3-5 hour retrieval)
# Glacier Deep Archive: $0.00099 (12 hour retrieval)
# Enable Intelligent-Tiering on existing bucket
aws s3api put-bucket-intelligent-tiering-configuration --bucket my-data-bucket --id main-config --intelligent-tiering-configuration '{
"Id": "main-config",
"Status": "Enabled",
"Tierings": [
{"Days": 90, "AccessTier": "ARCHIVE_INSTANT_ACCESS"},
{"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
]
}'
# S3 Lifecycle policy for log data (common savings: 70-80%)
aws s3api put-bucket-lifecycle-configuration --bucket application-logs --lifecycle-configuration '{
"Rules": [{
"ID": "log-tiering",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 730}
}]
}'
Strategy 7: EBS gp2 to gp3 Migration — Free Performance Upgrade
# gp3 is 20% cheaper than gp2 AND includes 3000 IOPS baseline (vs gp2's 100-3000 variable)
# gp2: $0.10/GB/month
# gp3: $0.08/GB/month + $0.005/IOPS above 3000 + $0.04/MB/s above 125MB/s
# Migrate all gp2 volumes to gp3
aws ec2 describe-volumes --filters Name=volume-type,Values=gp2 --query 'Volumes[*].VolumeId' --output text | tr ' ' '
' | while read vid; do
echo "Migrating $vid to gp3..."
aws ec2 modify-volume --volume-id "$vid" --volume-type gp3
done
# No downtime required — modification happens live
# For 1TB gp2: $100/month → $80/month (save $20/month, $240/year per volume)
Category 3: Database Cost Optimization
Strategy 8: RDS Multi-AZ Only When You Need It
RDS Multi-AZ costs exactly 2x the single-AZ price.
When Multi-AZ is worth it:
✓ Production databases serving customers
✓ RTO requirement under 2 minutes
✓ Regulatory requirement for HA
When Multi-AZ is NOT worth it (and where to save):
✗ Development databases → use single-AZ + snapshot backup
✗ Staging databases → single-AZ, restore from prod snapshot
✗ Analytics databases with offline ETL → snapshot-based recovery fine
✗ Read replicas → already provide read HA; primary can be single-AZ
Example savings:
db.r6g.2xlarge Multi-AZ: $1,540/month
db.r6g.2xlarge Single-AZ: $770/month
Save: $770/month for non-production workloads
Best practice for dev/staging:
- Use RDS scheduler to stop instances at night (save 60% of compute)
- Aurora Serverless v2 for dev: scales to 0 ACUs when idle, pay only for use
Strategy 9: Aurora Serverless v2 for Variable Workloads
resource "aws_rds_cluster" "api_db" {
cluster_identifier = "api-db"
engine = "aurora-postgresql"
engine_mode = "provisioned"
engine_version = "16.4"
serverlessv2_scaling_configuration {
min_capacity = 0.5 # 0.5 ACUs = ~1GB RAM (minimum cost when idle)
max_capacity = 32 # 32 ACUs = ~64GB RAM (max burst)
}
database_name = "appdb"
master_username = "admin"
manage_master_user_password = true
}
resource "aws_rds_cluster_instance" "api_db" {
identifier = "api-db-instance"
cluster_identifier = aws_rds_cluster.api_db.id
instance_class = "db.serverless" # Required for Serverless v2
engine = aws_rds_cluster.api_db.engine
}
# Aurora Serverless v2 pricing:
# $0.12 per ACU-hour (us-east-1)
# At 0.5 ACU (idle): $0.06/hr = $1.44/day = $43/month minimum
# At 4 ACU (normal): $0.48/hr = $11.52/day = $346/month
# At 16 ACU (peak): $1.92/hr (only for short bursts)
#
# vs Provisioned db.r6g.2xlarge: $0.77/hr = $556/month constant
# Savings for variable workloads: 30-70%
Category 4: Network and Data Transfer
Strategy 10: Data Transfer Cost Audit
# Data transfer is the sneakiest AWS cost — check your bill
# Common egress costs:
# EC2 to Internet: $0.09/GB (first 10TB/month)
# EC2 cross-AZ: $0.01/GB each direction (= $0.02/GB round trip)
# EC2 to S3 same region: FREE
# EC2 to CloudFront: FREE (origin fetch is free; CloudFront charges for edge delivery)
# NAT Gateway: $0.045/GB processed + $0.045/hr gateway charge
# Find your top data transfer costs
aws ce get-cost-and-usage --time-period Start=2026-02-01,End=2026-03-01 --granularity MONTHLY --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Elastic Compute Cloud - Compute"]}}' --group-by '[{"Type": "DIMENSION", "Key": "USAGE_TYPE"}]' --metrics BlendedCost --query 'ResultsByTime[0].Groups[?contains(Keys[0], 'DataTransfer')].[Keys[0],Metrics.BlendedCost.Amount]' --output table
# NAT Gateway is frequently the #1 surprise cost
# Each AZ needs a NAT Gateway: $0.045/hr × 3 AZs = $97/month just for the gateways
# Plus $0.045/GB of traffic processed
# Solutions:
# 1. Use VPC Endpoints for S3 and DynamoDB (free, eliminates NAT for those services)
# 2. Use PrivateLink for other AWS services
# 3. Ensure instances download from S3 directly (not via NAT)
Strategy 11: VPC Endpoints to Eliminate NAT Gateway Costs
# S3 Gateway endpoint — FREE, eliminates $0.045/GB NAT processing for S3
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [
aws_route_table.private_a.id,
aws_route_table.private_b.id,
aws_route_table.private_c.id,
]
}
# DynamoDB Gateway endpoint — FREE
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
route_table_ids = [aws_route_table.private_a.id]
}
# ECR Interface endpoint — reduces NAT Gateway traffic for Docker pulls
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
# Typical result: teams pulling Docker images from ECR via NAT Gateway
# were paying $200-500/month in NAT Gateway processing fees alone.
# Interface endpoints cost $0.01/hr + $0.01/GB = much cheaper than NAT.
Category 5: Lambda and Serverless
Strategy 12: Lambda Memory Right-Sizing
# Lambda pricing: $0.0000166667 per GB-second
# More memory = more cost PER INVOCATION but faster execution
# Optimal memory = sweet spot of lowest GB-seconds consumed
# Use AWS Lambda Power Tuning (open source tool)
# It runs your function at multiple memory sizes and finds the optimum
# Manual approach: CloudWatch Insights query
# fields @timestamp, @billedDuration, @memorySize, @maxMemoryUsed
# | stats avg(@billedDuration) as avgDuration,
# avg(@maxMemoryUsed) as avgMemUsed,
# avg(@memorySize) as allocatedMem
# | filter @type = "REPORT"
# Cost formula: (memoryGB × durationSeconds × invocations) × $0.0000166667
# Example: Function at 512MB averaging 800ms, 1M invocations/month
# Cost: 0.5GB × 0.8s × 1,000,000 × $0.0000166667 = $6.67/month
# Same function at 1024MB averaging 350ms:
# Cost: 1.0GB × 0.35s × 1,000,000 × $0.0000166667 = $5.83/month
# Faster AND cheaper! More memory = faster cold starts + faster execution.
# Rule of thumb: double memory, check if duration halves
# If it does: same cost. If duration drops 60%+: cheaper AND faster.
Strategy 13: Lambda Graviton2 for Free 20% Savings
resource "aws_lambda_function" "api" {
filename = "function.zip"
function_name = "api-handler"
role = aws_iam_role.lambda.arn
handler = "main.handler"
runtime = "python3.12"
# Graviton2 = 20% cheaper than x86 + better performance
architectures = ["arm64"] # Change from default ["x86_64"]
memory_size = 512
timeout = 30
}
# Python, Node.js, Java, Go, Ruby all support arm64 on Lambda
# Only requirement: your dependencies must be arm64 compatible
# For Python/Node: virtually all packages work without changes
Category 6: Kubernetes Cost Optimization
Strategy 14: Karpenter for Node Autoscaling
# Karpenter vs Cluster Autoscaler:
# - Karpenter provisions nodes in seconds (vs minutes for CA)
# - Karpenter consolidates nodes automatically (bin-packing)
# - Karpenter selects cheapest instance type for each workload
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["5"]
- key: "kubernetes.io/arch"
operator: In
values: ["arm64"] # Prefer Graviton
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"] # Try spot first
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenUnderutilized # Bin-pack and remove empty nodes
consolidateAfter: 30s # Aggressive consolidation
expireAfter: 720h # Recycle nodes every 30 days
Strategy 15: Vertical Pod Autoscaler (VPA) for Right-Sized Requests
# VPA analyzes actual resource usage and recommends/applies optimal requests
# Correct resource requests = correct node scheduling = no wasted node capacity
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: api
updatePolicy:
updateMode: "Auto" # Automatically apply recommendations
# "Off" = only recommend, "Initial" = apply only on new pods
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
# Check VPA recommendations:
# kubectl describe vpa api-vpa
# Look for "Recommendation" section showing target requests/limits
Category 7: Monitoring and FinOps Tooling
Strategy 16: AWS Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "main" {
name = "aws-cost-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "main" {
name = "cost-anomaly-alerts"
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"] # Alert if anomaly > $100
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
frequency = "DAILY"
monitor_arn_list = [aws_ce_anomaly_monitor.main.arn]
subscriber {
address = "devops@company.com"
type = "EMAIL"
}
subscriber {
address = aws_sns_topic.cost_alerts.arn
type = "SNS"
}
}
# AWS Budgets: hard limits with alerts
resource "aws_budgets_budget" "monthly" {
name = "monthly-cost-budget"
budget_type = "COST"
limit_amount = "50000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["devops@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["cto@company.com"]
}
}
Strategies 17-25: Quick Wins Reference
Strategy 17 — Stop idle resources at night
Lambda + EventBridge to stop dev/staging EC2 and RDS at 7pm, start at 7am
Save: 65% of dev/staging compute (16 hours off per day)
Strategy 18 — Delete unused Elastic IPs
aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]'
Each unattached EIP = $0.005/hr = $3.65/month (small but easy win)
Strategy 19 — S3 Request Cost Optimization
Use S3 Transfer Acceleration only where needed (adds cost)
Enable S3 Request Metrics to find buckets with excessive LIST/HEAD calls
Requester-pays for shared data sets accessed by multiple teams
Strategy 20 — CloudWatch Logs retention
Default: never expire. Set retention on all log groups.
aws logs put-retention-policy --log-group-name /aws/lambda/func --retention-in-days 30
CloudWatch Logs: $0.50/GB ingestion + $0.03/GB storage — adds up fast
Strategy 21 — Use CloudFront for S3 static content
EC2/S3 direct egress: $0.09/GB
CloudFront: $0.0085/GB (first 10TB, much cheaper for high-volume)
Also: fewer requests reach S3 → lower S3 request costs
Strategy 22 — Aurora auto-pause for dev databases
Aurora Serverless v2 with min 0 ACUs + auto-pause after 5 min idle
Dev database with 8 hours/day use: pay for 8 hours, not 24
Strategy 23 — Consolidate CloudTrail trails
Multi-region trails: $2/100k events. Single trail + S3 + Athena is cheaper than CloudWatch Logs
Disable management event logging in non-production accounts
Strategy 24 — Review NAT Gateway vs NAT Instance
NAT Gateway: $0.045/hr + $0.045/GB
NAT Instance (t4g.small): $0.0168/hr + no per-GB charge
For 500GB/month traffic: NAT Instance saves $250/month per AZ
Strategy 25 — Tag everything, use Cost Allocation Tags
Without tags: you cannot attribute costs to teams/products
With tags: charge-back drives ownership, teams self-optimize
Required tags: Project, Environment, Owner, CostCenter
Putting It Together: The 90-Day Cost Optimization Sprint
Week 1-2: Discovery
- Enable Cost Explorer and set up tagging
- Run Compute Optimizer on all EC2 and Lambda
- Audit EBS volumes (find unattached)
- Audit EIPs and NAT Gateways
- Check S3 lifecycle policies
Week 3-4: Quick wins (no architecture changes)
- Delete unattached EBS volumes and old snapshots
- gp2 → gp3 migration (all volumes)
- Add S3 lifecycle policies
- Add CloudWatch Logs retention policies
- Stop dev/staging instances at night
Expected: 10-15% savings
Month 2: Right-sizing and Graviton
- Right-size top 20 EC2 instances
- Migrate Lambda to arm64 (Graviton)
- Add VPC Endpoints for S3/DynamoDB/ECR
- Evaluate Spot for dev/test workloads
Expected: additional 15-20% savings
Month 3: Commitments and architecture
- Purchase Savings Plans for stable baseline (after right-sizing)
- Migrate eligible dev RDS to Aurora Serverless v2
- Deploy Karpenter for Kubernetes workloads
- Implement VPA recommendations
Expected: additional 15-25% savings
Total after 90 days: 40-60% reduction
Conclusion
Cloud cost optimization is a continuous practice, not a one-time project. The 25 strategies above are roughly ordered by ROI — the quick wins in weeks 1-2 alone can save 10-15% with a few hours of work. Graviton migration, right-sizing, and Savings Plans then compound those savings significantly.
The most important practice is visibility: tag everything, set up anomaly detection, and review Cost Explorer weekly. Teams that treat cloud spend as an engineering metric — not just a finance concern — consistently outperform those that don't by 2-3x on cost efficiency.
Alex Thompson
CEO & Cloud Architecture Expert at ZeonEdge with 15+ years building enterprise infrastructure.