The Observability Cost Problem
Observability costs are invisible until they hit the bill. CloudWatch charges $0.50/GB for log ingestion, $0.30/GB for storage beyond 5GB, $0.01 per 1,000 custom metrics, and $3/dashboard/month. A modest production system with 50 services generating 1GB of logs per day and 500 custom metrics can easily spend $700-900/month on CloudWatch alone.
Datadog's per-host pricing ($23/host/month for Infrastructure) looks manageable at 10 hosts but reaches $2,300/month at 100 hosts — before APM ($31/host), Logs ($0.10/GB ingested + $1.70/million analyzed), and other add-ons. At scale, observability can become the second-largest cloud spend line item.
CloudWatch Cost Breakdown and Optimization
CloudWatch Pricing (us-east-1):
Logs:
Ingestion: $0.50/GB
Storage: $0.03/GB/month (first 5GB free)
Queries: $0.005/GB scanned (Insights)
Metrics:
Custom metrics: $0.30/metric/month (first 10 free)
API calls: $0.01 per 1,000 GetMetricData calls
Contributor Insights: $0.02 per rule per hour
Dashboards:
$3/dashboard/month (first 3 free)
Alarms:
$0.10/alarm/month (first 10 free)
Example: 50-service production system
Logs: 50 services Ă— 500MB/day Ă— 30days = 750GB
Ingestion: 750 Ă— $0.50 = $375/month
Storage (30-day retention): 750GB Ă— $0.03 = $22.50/month
Custom metrics: 200 Ă— $0.30 = $60/month
Dashboards: 10 Ă— $3 = $30/month
Total: ~$487.50/month
Reducing CloudWatch Log Costs with Subscription Filters
import boto3
import json
logs = boto3.client('logs', region_name='us-east-1')
# Strategy 1: Filter logs BEFORE they reach CloudWatch
# Use a Lambda subscription filter to drop DEBUG/INFO logs in production
# Lambda function to filter logs (deployed via CloudFormation)
FILTER_FUNCTION_CODE = """
import base64
import gzip
import json
def handler(event, context):
# Decode CloudWatch Logs data
compressed = base64.b64decode(event['awslogs']['data'])
uncompressed = gzip.decompress(compressed)
log_data = json.loads(uncompressed)
filtered_events = []
for log_event in log_data['logEvents']:
message = log_event['message']
# Only keep WARN, ERROR, CRITICAL — drop DEBUG/INFO
if any(level in message for level in ['WARN', 'ERROR', 'CRITICAL', 'Exception', 'Traceback']):
filtered_events.append(log_event)
# Forward filtered events to S3 or another log group
# Dropping events = CloudWatch never charges for them
return {'filtered': len(log_data['logEvents']) - len(filtered_events)}
"""
# Strategy 2: Set appropriate log retention (stop paying for old logs)
def set_retention_policies():
"""Set 30-day retention for most log groups, 7-day for debug."""
paginator = logs.get_paginator('describe_log_groups')
for page in paginator.paginate():
for group in page['logGroups']:
name = group['logGroupName']
current_retention = group.get('retentionInDays', 'Never')
# Never-expiring logs are a cost time bomb
if current_retention == 'Never':
if '/debug/' in name or '/dev/' in name:
retention = 7
elif '/staging/' in name:
retention = 14
else:
retention = 30
logs.put_retention_policy(
logGroupName=name,
retentionInDays=retention
)
print(f"Set {name}: {retention} days (was: Never)")
set_retention_policies()
# Find log groups with no retention policy (potential cost bomb)
aws logs describe-log-groups --query 'logGroups[?!retentionInDays].{Name:logGroupName,Size:storedBytes}' --output table | sort -k4 -rn | head -20
# Find the most expensive log groups by stored bytes
aws logs describe-log-groups --query 'sort_by(logGroups, &storedBytes)[-10:].{Name:logGroupName,GB:storedBytes}' --output table
# Convert bytes to GB for readability
aws logs describe-log-groups --query 'logGroups[*].[logGroupName, storedBytes]' --output text | awk '{printf "%-60s %.2f GB
", $1, $2/1073741824}' | sort -k2 -rn | head -20
Datadog True Cost vs CloudWatch
Cost comparison for 50-node production cluster:
DATADOG (Infrastructure + APM + Logs):
Infrastructure: 50 hosts Ă— $23/month = $1,150
APM: 50 hosts Ă— $31/month = $1,550
Logs ingestion: 750GB Ă— $0.10 = $75
Log indexing: 750GB Ă— $1.70 = $1,275
Monthly total: $4,050
CLOUDWATCH:
Logs: $487/month (as calculated above)
Container Insights: 50 nodes Ă— $0.35/node/month = $17.50
X-Ray (tracing): $5/1M traces + $0.50/1M = ~$100/month at 20M traces
Monthly total: ~$604/month
SAVING vs Datadog: $3,446/month (85%)
SELF-HOSTED (EKS, Prometheus + Grafana + Loki):
EC2 for stack: 3Ă— t3.large = 3 Ă— $60.48 = $181.44/month
EBS storage: 1TB Ă— $0.08/GB = $81.92/month
S3 long-term storage (Thanos/Cortex): ~$20/month
Operational engineering: 4hr/month Ă— $100 = $400/month
Monthly total: ~$683/month
Note: At scale (500+ nodes), self-hosted becomes far cheaper
At 50 nodes: Datadog ($4,050) vs self-hosted ($683) = $3,367 saving
Self-Hosted Prometheus + Grafana + Loki Stack
# Deploy the observability stack with Helm
# This replaces Datadog for most use cases
# Add Helm repos
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm repo add grafana https://grafana.github.io/helm-charts
# helm repo update
# values-prometheus.yaml
global:
scrape_interval: 30s # Default 15s — 2x cost reduction on storage
prometheus:
prometheusSpec:
retention: 15d # Keep 15 days local (use Thanos for longer)
retentionSize: 50GB
# Resource requests — right-sized for 50 nodes
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2
memory: 8Gi
# Remote write to S3 via Thanos for long-term storage ($0.023/GB vs $0.30/GB EBS)
remoteWrite:
- url: http://thanos-receive:19291/api/v1/receive
# Don't scrape every pod — use relabeling to drop noisy metrics
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
---
# values-loki.yaml
loki:
storage:
type: s3
s3:
endpoint: s3.amazonaws.com
region: us-east-1
bucketnames: my-loki-logs
limits_config:
ingestion_rate_mb: 32
max_streams_per_user: 10000
retention_period: 30d # Auto-delete after 30 days
# Compression reduces storage cost 5-10x
chunk_encoding: snappy
Sampling Strategies for Distributed Tracing
# Don't trace every request — intelligent sampling saves 90% of tracing costs
# AWS X-Ray: sampling rules
import boto3
xray = boto3.client('xray', region_name='us-east-1')
xray.create_sampling_rule(
SamplingRule={
'RuleName': 'production-api',
'Priority': 1,
'FixedRate': 0.05, # Sample 5% of requests
'ReservoirSize': 5, # Always sample 5 requests/sec regardless of rate
'ServiceName': 'api',
'ServiceType': '*',
'Host': '*',
'HTTPMethod': '*',
'URLPath': '*',
'ResourceARN': '*',
'Version': 1
}
)
# Error sampling: always sample errors and slow requests
xray.create_sampling_rule(
SamplingRule={
'RuleName': 'always-sample-errors',
'Priority': 0, # Higher priority
'FixedRate': 1.0, # 100% of errors
'ReservoirSize': 100,
'ServiceName': 'api',
'ServiceType': '*',
'Host': '*',
'HTTPMethod': '*',
'URLPath': '/api/checkout', # Always trace checkout
'ResourceARN': '*',
'Version': 1
}
)
# At 5% sampling: 20M requests/day → 1M traces sampled
# X-Ray cost: 1M Ă— $5/1M = $5/day = $150/month
# vs unsampled: $2,000/month
# Saving: $1,850/month (92.5%)
Conclusion
Observability costs are controllable with the right strategies. For CloudWatch: set retention policies, filter noisy logs before ingestion, and use subscription filters to drop DEBUG/INFO in production. For expensive Datadog deployments: evaluate whether a self-hosted Prometheus+Grafana+Loki stack meets your needs at 85% lower cost. Use 5% sampling for traces, keeping 100% sampling only for errors and critical paths.
The goal is not to reduce observability — it is to stop paying for data you never look at. Most teams find that 80% of their logs are DEBUG-level noise that adds zero value in production.
Daniel Park
AI/ML Engineer focused on practical applications of machine learning in DevOps and cloud operations.