The Reactive vs Proactive Cost Problem
Most organizations review cloud costs once a month when the bill arrives. By then, a rogue process that ran for three weeks, an accidentally-left-on GPU cluster, or an infinite retry loop has already cost thousands of dollars. The damage is done and unrecoverable.
AWS Budgets and Cost Anomaly Detection shift this to proactive: set a budget threshold and get alerted before you exceed it. Detect statistically anomalous spend patterns within 24 hours. Automatically apply service control policies when budgets are exceeded. The goal is catching problems in hours, not weeks.
AWS Budgets Setup
# Terraform: Create budget types for comprehensive coverage
# 1. Monthly total cost budget
resource "aws_budgets_budget" "total_monthly" {
name = "total-monthly-cost"
budget_type = "COST"
limit_amount = "5000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["*"] # All services
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80 # Alert at 80% of budget
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["finops-team@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100 # Alert when budget exceeded
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["finops-team@company.com", "engineering-manager@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 110 # Forecast exceeds budget
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["finops-team@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# 2. Per-service budget (EC2 is often the biggest line item)
resource "aws_budgets_budget" "ec2_monthly" {
name = "ec2-monthly-cost"
budget_type = "COST"
limit_amount = "2000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# 3. Per-team budget (using cost allocation tags)
resource "aws_budgets_budget" "team_ml" {
name = "team-ml-monthly"
budget_type = "COST"
limit_amount = "3000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["Team$ml"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 85
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["ml-team@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
resource "aws_sns_topic" "budget_alerts" {
name = "budget-alerts"
}
resource "aws_sns_topic_policy" "budget_alerts" {
arn = aws_sns_topic.budget_alerts.arn
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "budgets.amazonaws.com" }
Action = "SNS:Publish"
Resource = aws_sns_topic.budget_alerts.arn
}]
})
}
Budget Actions: Automatic Enforcement
# Budget Actions automatically apply SCP or IAM policies when budget is exceeded
# Action 1: Deny EC2 instance launches when budget exceeded
resource "aws_budgets_budget_action" "deny_ec2_on_exceed" {
budget_name = aws_budgets_budget.total_monthly.name
action_type = "APPLY_SCP_POLICY"
approval_model = "AUTOMATIC"
notification_type = "ACTUAL"
execution_role_arn = aws_iam_role.budget_actions.arn
action_threshold {
action_threshold_type = "PERCENTAGE"
action_threshold_value = 100 # Trigger at 100% of budget
}
definition {
scp_action_definition {
master_account_id = data.aws_caller_identity.current.account_id
policy_id = aws_organizations_policy.deny_ec2_launches.id
target_ids = [data.aws_caller_identity.current.account_id]
}
}
subscriber {
address = "finops-team@company.com"
subscription_type = "EMAIL"
}
}
# The SCP that gets applied when budget is exceeded
resource "aws_organizations_policy" "deny_ec2_launches" {
name = "DenyEC2OnBudgetExceed"
description = "Applied by Budget Action when monthly budget exceeded"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyNewEC2Instances"
Effect = "Deny"
Action = ["ec2:RunInstances"]
Resource = "arn:aws:ec2:*:*:instance/*"
Condition = {
StringNotLike = {
"aws:PrincipalArn" = "arn:aws:iam::*:role/BudgetActionRole"
}
}
}
]
})
}
Cost Anomaly Detection Setup
# AWS Cost Anomaly Detection uses ML to find unusual spend patterns
# No threshold to set — it learns your baseline and alerts on deviations
resource "aws_ce_anomaly_monitor" "services_monitor" {
name = "services-anomaly-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE" # Monitor each AWS service independently
}
resource "aws_ce_anomaly_monitor" "team_monitor" {
name = "team-cost-monitor"
monitor_type = "CUSTOM"
monitor_specification = jsonencode({
Tags = {
Key = "Team"
Values = ["ml", "backend", "platform", "data"]
}
})
}
# Subscription: get alerted when anomaly is detected
resource "aws_ce_anomaly_subscription" "realtime_alerts" {
name = "realtime-anomaly-alerts"
frequency = "IMMEDIATE" # Alert within 24hr of detection (vs DAILY/WEEKLY)
monitor_arn_list = [
aws_ce_anomaly_monitor.services_monitor.arn,
aws_ce_anomaly_monitor.team_monitor.arn,
]
subscriber {
address = aws_sns_topic.anomaly_alerts.arn
type = "SNS"
}
threshold_expression {
and {
# Alert on anomalies > $50 total impact
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["50"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
# AND > 20% over expected spend
dimension {
key = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
values = ["20"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
}
resource "aws_sns_topic" "anomaly_alerts" {
name = "cost-anomaly-alerts"
}
resource "aws_sns_topic_policy" "anomaly_alerts" {
arn = aws_sns_topic.anomaly_alerts.arn
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "costalerts.amazonaws.com" }
Action = "SNS:Publish"
Resource = aws_sns_topic.anomaly_alerts.arn
}]
})
}
Slack Integration via Lambda
import json
import os
import urllib.request
from typing import Any
SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']
def format_budget_alert(message: dict) -> str:
"""Format AWS Budget SNS notification for Slack."""
account = message.get('AccountId', 'unknown')
budget_name = message.get('BudgetName', 'unknown')
actual = float(message.get('ActualSpend', {}).get('Amount', 0))
limit = float(message.get('BudgetLimit', {}).get('Amount', 0))
pct = (actual / limit * 100) if limit else 0
emoji = "🔴" if pct >= 100 else "🟡" if pct >= 80 else "🟢"
return f"""{emoji} *Budget Alert: {budget_name}*
Account: {account}
Actual: actual:,.2f / Limit: limit:,.2f ({pct:.1f}%)
Action required: Review https://console.aws.amazon.com/cost-management/"""
def format_anomaly_alert(message: dict) -> str:
"""Format Cost Anomaly Detection SNS notification for Slack."""
anomaly_id = message.get('anomalyId', 'unknown')
service = message.get('dimensionValue', 'unknown')
impact = float(message.get('impact', {}).get('totalActualSpend', 0))
expected = float(message.get('impact', {}).get('totalExpectedSpend', 0))
pct_change = ((impact - expected) / expected * 100) if expected else 0
return f"""🚨 *Cost Anomaly Detected*
Service: {service}
Anomaly ID: {anomaly_id}
Actual spend: impact:,.2f
Expected spend: expected:,.2f
Deviation: +{pct_change:.0f}% above baseline
View details: https://console.aws.amazon.com/cost-management/home#/anomaly-detection"""
def handler(event, context):
for record in event['Records']:
sns_message = json.loads(record['Sns']['Message'])
subject = record['Sns'].get('Subject', '')
if 'Budget' in subject:
text = format_budget_alert(sns_message)
elif 'anomaly' in str(sns_message).lower():
text = format_anomaly_alert(sns_message)
else:
text = f"AWS Cost Alert:
{json.dumps(sns_message, indent=2)}"
payload = json.dumps({
"text": text,
"mrkdwn": True
}).encode('utf-8')
req = urllib.request.Request(
SLACK_WEBHOOK_URL,
data=payload,
headers={'Content-Type': 'application/json'},
method='POST'
)
urllib.request.urlopen(req)
return {'statusCode': 200}
Monthly FinOps Review Checklist
Monthly FinOps Review (30 minutes):
Week 1 (first Monday):
1. Review Cost Explorer: total vs budget, by service, by team
2. Check Cost Anomaly Detection: any unresolved anomalies?
3. Review Trusted Advisor: low-utilization instances, idle LBs
4. Check Reserved Instance/Savings Plan utilization rate
Goal: >85% utilization, <10% coverage gap
Week 2:
5. Rightsizing recommendations: act on any >30% CPU average
6. Delete stale resources: stopped EC2 >30 days, old snapshots
7. Review S3 storage class distribution (has Intelligent Tiering moved data?)
Week 3:
8. Forecast vs actual: are we on track for monthly target?
9. Team cost allocation reports — send to each team manager
10. Review any budget alerts triggered in past 2 weeks
Month End:
11. Close-out: document actual vs budget variance
12. Update budget for next month based on expected changes
13. Purchase new RIs/SPs if coverage is low (anniversary reminder)
14. Update FinOps KPI dashboard
Key Metrics to Track:
- Total spend vs budget (target: <100%)
- RI/SP utilization (target: >85%)
- Rightsizing savings captured (target: >80% of recommendations acted on)
- Untagged resources (target: <5%)
- Cost per unit metric (e.g., cost per active user, cost per API call)
Conclusion
AWS Budgets and Cost Anomaly Detection are the two most important FinOps controls to deploy first. They cost almost nothing to configure (Budgets: free for first 2, $0.01/day each after; Anomaly Detection: free) and provide immediate value. Budget Actions add automated enforcement that turns policy into reality — when the budget is exceeded, the guardrail activates without human intervention.
Combined with the strategies throughout this cloud cost series — right-sizing, Savings Plans, Spot instances, VPC Endpoints, storage tiering, and tagging — automated budget alerts and anomaly detection complete the FinOps loop: optimize proactively, enforce automatically, and respond immediately to deviations. This is the full cloud cost optimization playbook.
Sarah Chen
Senior Cybersecurity Engineer with 12+ years of experience in penetration testing and security architecture.