Taming Cloud Observability Costs: CloudWatch, Datadog, and Open Source Alternatives

The Observability Cost Problem

Observability costs are invisible until they hit the bill. CloudWatch charges $0.50/GB for log ingestion, $0.30/GB for storage beyond 5GB, $0.01 per 1,000 custom metrics, and $3/dashboard/month. A modest production system with 50 services generating 1GB of logs per day and 500 custom metrics can easily spend $700-900/month on CloudWatch alone.

Datadog's per-host pricing ($23/host/month for Infrastructure) looks manageable at 10 hosts but reaches $2,300/month at 100 hosts — before APM ($31/host), Logs ($0.10/GB ingested + $1.70/million analyzed), and other add-ons. At scale, observability can become the second-largest cloud spend line item.

CloudWatch Cost Breakdown and Optimization

CloudWatch Pricing (us-east-1):

Logs:
  Ingestion:  $0.50/GB
  Storage:    $0.03/GB/month (first 5GB free)
  Queries:    $0.005/GB scanned (Insights)
  
Metrics:
  Custom metrics:   $0.30/metric/month (first 10 free)
  API calls:        $0.01 per 1,000 GetMetricData calls
  Contributor Insights: $0.02 per rule per hour
  
Dashboards:
  $3/dashboard/month (first 3 free)

Alarms:
  $0.10/alarm/month (first 10 free)

Example: 50-service production system
  Logs: 50 services × 500MB/day × 30days = 750GB
    Ingestion: 750 × $0.50 = $375/month
    Storage (30-day retention): 750GB × $0.03 = $22.50/month
  Custom metrics: 200 × $0.30 = $60/month
  Dashboards: 10 × $3 = $30/month
  Total: ~$487.50/month

Reducing CloudWatch Log Costs with Subscription Filters

import boto3
import json

logs = boto3.client('logs', region_name='us-east-1')

# Strategy 1: Filter logs BEFORE they reach CloudWatch
# Use a Lambda subscription filter to drop DEBUG/INFO logs in production

# Lambda function to filter logs (deployed via CloudFormation)
FILTER_FUNCTION_CODE = """
import base64
import gzip
import json

def handler(event, context):
    # Decode CloudWatch Logs data
    compressed = base64.b64decode(event['awslogs']['data'])
    uncompressed = gzip.decompress(compressed)
    log_data = json.loads(uncompressed)
    
    filtered_events = []
    for log_event in log_data['logEvents']:
        message = log_event['message']
        
        # Only keep WARN, ERROR, CRITICAL — drop DEBUG/INFO
        if any(level in message for level in ['WARN', 'ERROR', 'CRITICAL', 'Exception', 'Traceback']):
            filtered_events.append(log_event)
    
    # Forward filtered events to S3 or another log group
    # Dropping events = CloudWatch never charges for them
    return {'filtered': len(log_data['logEvents']) - len(filtered_events)}
"""

# Strategy 2: Set appropriate log retention (stop paying for old logs)
def set_retention_policies():
    """Set 30-day retention for most log groups, 7-day for debug."""
    paginator = logs.get_paginator('describe_log_groups')
    
    for page in paginator.paginate():
        for group in page['logGroups']:
            name = group['logGroupName']
            current_retention = group.get('retentionInDays', 'Never')
            
            # Never-expiring logs are a cost time bomb
            if current_retention == 'Never':
                if '/debug/' in name or '/dev/' in name:
                    retention = 7
                elif '/staging/' in name:
                    retention = 14
                else:
                    retention = 30
                    
                logs.put_retention_policy(
                    logGroupName=name,
                    retentionInDays=retention
                )
                print(f"Set {name}: {retention} days (was: Never)")

set_retention_policies()

# Find log groups with no retention policy (potential cost bomb)
aws logs describe-log-groups   --query 'logGroups[?!retentionInDays].{Name:logGroupName,Size:storedBytes}'   --output table | sort -k4 -rn | head -20

# Find the most expensive log groups by stored bytes
aws logs describe-log-groups   --query 'sort_by(logGroups, &storedBytes)[-10:].{Name:logGroupName,GB:storedBytes}'   --output table

# Convert bytes to GB for readability
aws logs describe-log-groups   --query 'logGroups[*].[logGroupName, storedBytes]'   --output text | awk '{printf "%-60s %.2f GB
", $1, $2/1073741824}' |   sort -k2 -rn | head -20

Datadog True Cost vs CloudWatch

Cost comparison for 50-node production cluster:

DATADOG (Infrastructure + APM + Logs):
  Infrastructure: 50 hosts × $23/month = $1,150
  APM: 50 hosts × $31/month = $1,550
  Logs ingestion: 750GB × $0.10 = $75
  Log indexing: 750GB × $1.70 = $1,275
  Monthly total: $4,050

CLOUDWATCH:
  Logs: $487/month (as calculated above)
  Container Insights: 50 nodes × $0.35/node/month = $17.50
  X-Ray (tracing): $5/1M traces + $0.50/1M = ~$100/month at 20M traces
  Monthly total: ~$604/month
  
  SAVING vs Datadog: $3,446/month (85%)

SELF-HOSTED (EKS, Prometheus + Grafana + Loki):
  EC2 for stack: 3× t3.large = 3 × $60.48 = $181.44/month
  EBS storage: 1TB × $0.08/GB = $81.92/month
  S3 long-term storage (Thanos/Cortex): ~$20/month
  Operational engineering: 4hr/month × $100 = $400/month
  Monthly total: ~$683/month
  
  Note: At scale (500+ nodes), self-hosted becomes far cheaper
  At 50 nodes: Datadog ($4,050) vs self-hosted ($683) = $3,367 saving

Self-Hosted Prometheus + Grafana + Loki Stack

# Deploy the observability stack with Helm
# This replaces Datadog for most use cases

# Add Helm repos
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm repo add grafana https://grafana.github.io/helm-charts
# helm repo update

# values-prometheus.yaml
global:
  scrape_interval: 30s  # Default 15s — 2x cost reduction on storage

prometheus:
  prometheusSpec:
    retention: 15d  # Keep 15 days local (use Thanos for longer)
    retentionSize: 50GB
    
    # Resource requests — right-sized for 50 nodes
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2
        memory: 8Gi

    # Remote write to S3 via Thanos for long-term storage ($0.023/GB vs $0.30/GB EBS)
    remoteWrite:
      - url: http://thanos-receive:19291/api/v1/receive

  # Don't scrape every pod — use relabeling to drop noisy metrics
  additionalScrapeConfigs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        # Only scrape pods with annotation prometheus.io/scrape=true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true

---
# values-loki.yaml
loki:
  storage:
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      region: us-east-1
      bucketnames: my-loki-logs
      
  limits_config:
    ingestion_rate_mb: 32
    max_streams_per_user: 10000
    retention_period: 30d  # Auto-delete after 30 days
    
  # Compression reduces storage cost 5-10x
  chunk_encoding: snappy

Sampling Strategies for Distributed Tracing

# Don't trace every request — intelligent sampling saves 90% of tracing costs

# AWS X-Ray: sampling rules
import boto3
xray = boto3.client('xray', region_name='us-east-1')

xray.create_sampling_rule(
    SamplingRule={
        'RuleName': 'production-api',
        'Priority': 1,
        'FixedRate': 0.05,      # Sample 5% of requests
        'ReservoirSize': 5,     # Always sample 5 requests/sec regardless of rate
        'ServiceName': 'api',
        'ServiceType': '*',
        'Host': '*',
        'HTTPMethod': '*',
        'URLPath': '*',
        'ResourceARN': '*',
        'Version': 1
    }
)

# Error sampling: always sample errors and slow requests
xray.create_sampling_rule(
    SamplingRule={
        'RuleName': 'always-sample-errors',
        'Priority': 0,          # Higher priority
        'FixedRate': 1.0,       # 100% of errors
        'ReservoirSize': 100,
        'ServiceName': 'api',
        'ServiceType': '*',
        'Host': '*',
        'HTTPMethod': '*',
        'URLPath': '/api/checkout',  # Always trace checkout
        'ResourceARN': '*',
        'Version': 1
    }
)

# At 5% sampling: 20M requests/day → 1M traces sampled
# X-Ray cost: 1M × $5/1M = $5/day = $150/month
# vs unsampled: $2,000/month
# Saving: $1,850/month (92.5%)

Conclusion

Observability costs are controllable with the right strategies. For CloudWatch: set retention policies, filter noisy logs before ingestion, and use subscription filters to drop DEBUG/INFO in production. For expensive Datadog deployments: evaluate whether a self-hosted Prometheus+Grafana+Loki stack meets your needs at 85% lower cost. Use 5% sampling for traces, keeping 100% sampling only for errors and critical paths.

The goal is not to reduce observability — it is to stop paying for data you never look at. Most teams find that 80% of their logs are DEBUG-level noise that adds zero value in production.

Taming Cloud Observability Costs: CloudWatch, Datadog, and Open Source Alternatives

The Observability Cost Problem

CloudWatch Cost Breakdown and Optimization

Reducing CloudWatch Log Costs with Subscription Filters

Datadog True Cost vs CloudWatch

Self-Hosted Prometheus + Grafana + Loki Stack

Sampling Strategies for Distributed Tracing

Conclusion

태그

관련 기사

AWS Budgets and Cost Anomaly Detection: Automated FinOps Guardrails

Cloud Tagging Strategy at Scale: Enforcing Cost Allocation Across 100+ AWS Accounts

Multi-Cloud Networking Costs: Transit Gateway, VPC Peering, and Cross-Cloud Egress

인프라를 혁신할 준비가 되셨습니까?