Every startup follows the same cloud cost trajectory: launch on a generous free tier, scale up to a few hundred dollars per month as you find product-market fit, then watch in horror as costs compound to $20K, $40K, $80K/month as traffic grows. By the time the CEO asks "why is our AWS bill so high?", the architecture has calcified around expensive defaults, and optimization feels like untangling a knot of yarn while wearing oven mitts.
This article tells the story of a real engagement (client details anonymized) where we helped a Series A SaaS startup reduce their AWS bill from $47,200/month to $12,100/month β a 74% reduction β while actually improving performance. We'll cover every technique we used, the order we applied them, and the results of each change.
Step 1: The Cost Audit β Where Is the Money Going?
Before optimizing anything, you need to understand your current spending. AWS Cost Explorer is useful for high-level trends, but for actionable detail, you need to tag every resource and break down costs by service, environment, and team.
This client's $47,200/month broke down as:
EC2 instances: $18,400 (39%) β 14 instances ranging from t3.xlarge to m5.4xlarge, running 24/7
RDS (PostgreSQL): $8,200 (17%) β Two db.r5.2xlarge Multi-AZ instances
ElastiCache (Redis): $3,800 (8%) β cache.r5.xlarge Multi-AZ
NAT Gateway: $4,100 (9%) β data transfer through NAT
S3 + CloudFront: $3,200 (7%) β media storage and CDN
ECS/Fargate: $3,900 (8%) β background workers
Data Transfer: $2,600 (6%) β cross-AZ and internet egress
Other: $3,000 (6%) β CloudWatch, Route53, WAF, etc.
Step 2: Rightsizing β The Biggest Bang for Zero Effort
Rightsizing means matching instance sizes to actual utilization. AWS Compute Optimizer and CloudWatch metrics revealed that most EC2 instances were running at 8-15% average CPU utilization. This is a classic pattern: developers provision for peak load, but peak load happens 2% of the time.
Changes made:
Downgraded 8 instances from m5.xlarge (4 vCPU, 16GB) to t3.medium (2 vCPU, 4GB). CPU utilization went from 12% to 45% β still plenty of headroom. Monthly savings: $3,840.
Downgraded the primary RDS instance from db.r5.2xlarge (8 vCPU, 64GB) to db.r6g.xlarge (4 vCPU, 32GB, Graviton). The r5.2xlarge had 6% average CPU and 18% memory utilization. The Graviton instance is both cheaper and faster. Monthly savings: $2,400.
Downgraded ElastiCache from cache.r5.xlarge to cache.r6g.large (Graviton). Monthly savings: $1,200.
Total rightsizing savings: $7,440/month (16% of original bill)
Step 3: Reserved Instances and Savings Plans
After rightsizing, the remaining instances are correctly sized and will run for at least the next 12 months. This makes them perfect candidates for Reserved Instances (RIs) or Compute Savings Plans (CSPs).
We purchased 1-year No Upfront Compute Savings Plans for the steady-state compute. The commitment covers EC2, Fargate, and Lambda with automatic application to the cheapest matching usage. Compared to on-demand pricing, this provided a 36% discount.
For RDS, we purchased 1-year Reserved Instances for the production database (always running). The staging database stays on-demand since it's only used during business hours.
Total RI/Savings Plan savings: $8,600/month (additional 18% of original bill)
Step 4: NAT Gateway β The Silent Budget Killer
NAT Gateway is one of the most expensive AWS services per GB, and most teams don't realize how much they're paying. At $0.045/GB processed plus $0.045/hour per gateway, this client's NAT Gateway was processing 80+ TB/month of traffic β most of it S3 and DynamoDB API calls from private subnets.
The fix was simple: add VPC Gateway Endpoints for S3 and DynamoDB. Gateway endpoints are free β they route traffic directly from your VPC to the AWS service without going through the NAT Gateway. We also added Interface Endpoints for ECR (to avoid pulling container images through NAT) and CloudWatch Logs.
NAT Gateway savings: $3,200/month (7% of original bill)
Step 5: Spot Instances for Background Workers
The client ran background workers (email processing, report generation, data pipeline jobs) on Fargate with on-demand pricing. These workloads are fault-tolerant β if a worker is interrupted, the job retries. This makes them ideal for Spot Instances, which offer 60-90% discounts in exchange for the possibility of 2-minute interruption notices.
We migrated background workers from Fargate to EC2 Spot instances using a diversified fleet strategy (multiple instance types and AZs to reduce interruption probability). We also implemented SQS-based job queues with automatic retry to handle any interruptions gracefully.
Spot instance savings: $2,800/month (6% of original bill)
Step 6: Architecture Changes β S3 Storage Classes and Caching
The client stored 12TB of user-uploaded media in S3 Standard. Analysis showed that 80% of files hadn't been accessed in 90+ days. We implemented S3 Intelligent-Tiering, which automatically moves infrequently accessed objects to cheaper storage tiers. For files older than 180 days, we added a lifecycle rule to move them to S3 Glacier Instant Retrieval.
We also added CloudFront caching in front of the API for read-heavy endpoints (product catalog, user profiles), reducing origin requests by 70% and saving on both compute and data transfer.
Storage and caching savings: $2,100/month (4% of original bill)
Step 7: Environment Scheduling β Stop Paying for Development at 3 AM
The staging and development environments ran 24/7. Nobody deploys to staging at 3 AM on a Saturday. We implemented AWS Instance Scheduler to automatically stop non-production environments outside business hours (8 AM - 8 PM, Monday-Friday). This reduced non-production compute hours by 65%.
Environment scheduling savings: $2,800/month (6% of original bill)
Step 8: Observability Cost Reduction
CloudWatch costs were $1,800/month β mostly from verbose application logging that nobody read. We reduced log verbosity in production (INFO instead of DEBUG), set log retention to 30 days instead of "never delete," and moved metrics to a self-hosted Prometheus + Grafana stack on a single t3.medium instance ($30/month). This eliminated most CloudWatch costs while actually improving the monitoring experience.
Observability savings: $1,600/month (3% of original bill)
Results Summary
Here's the complete breakdown of savings:
Original monthly bill: $47,200
After rightsizing: -$7,440 (16%)
After RIs/Savings Plans: -$8,600 (18%)
After NAT Gateway optimization: -$3,200 (7%)
After Spot instances: -$2,800 (6%)
After storage/caching: -$2,100 (4%)
After environment scheduling: -$2,800 (6%)
After observability reduction: -$1,600 (3%)
Remaining minor optimizations: -$6,560 (14%)
βββββββββββββββββββββββββββββββββββββββββ
New monthly bill: $12,100 (74% reduction)
The total annual savings: $421,200. The engagement cost (3 weeks of our time): $18,000. ROI: 23x in the first year.
Making Cost Optimization Stick
One-time optimization is worthless if costs creep back up. We implemented three cultural practices to keep costs under control permanently:
Cost alerts: AWS Budgets alerts when spending exceeds 80% of the monthly target. The CTO and engineering lead get notified immediately.
Cost-per-team dashboards: Using resource tagging, each team can see their infrastructure costs. This creates awareness and healthy competition.
Architecture review cost check: Every architecture decision that adds a new AWS service requires a cost estimate as part of the design review. "What will this cost at 10x our current traffic?" prevents surprises.
ZeonEdge provides cloud cost optimization audits and ongoing FinOps consulting. We typically find 40-70% savings for companies spending $10K+/month on cloud. Get a free cost assessment.
Alex Thompson
CEO & Cloud Architecture Expert at ZeonEdge with 15+ years building enterprise infrastructure.