BlogCloud & Infrastructure
Cloud & Infrastructure

Taming Cloud Observability Costs: CloudWatch, Datadog, and Open Source Alternatives

Observability is essential, but CloudWatch log ingestion at $0.50/GB and Datadog at $23/host/month add up fast. Learn how to filter logs before ingestion, use metric namespaces efficiently, sample traces, and compare the true cost of CloudWatch vs Datadog vs self-hosted Prometheus+Grafana+Loki.

D

Daniel Park

AI/ML Engineer focused on practical applications of machine learning in DevOps and cloud operations.

April 12, 2026
21 min read

The Observability Cost Problem

Observability costs are invisible until they hit the bill. CloudWatch charges $0.50/GB for log ingestion, $0.30/GB for storage beyond 5GB, $0.01 per 1,000 custom metrics, and $3/dashboard/month. A modest production system with 50 services generating 1GB of logs per day and 500 custom metrics can easily spend $700-900/month on CloudWatch alone.

Datadog's per-host pricing ($23/host/month for Infrastructure) looks manageable at 10 hosts but reaches $2,300/month at 100 hosts — before APM ($31/host), Logs ($0.10/GB ingested + $1.70/million analyzed), and other add-ons. At scale, observability can become the second-largest cloud spend line item.

CloudWatch Cost Breakdown and Optimization

CloudWatch Pricing (us-east-1):

Logs:
  Ingestion:  $0.50/GB
  Storage:    $0.03/GB/month (first 5GB free)
  Queries:    $0.005/GB scanned (Insights)
  
Metrics:
  Custom metrics:   $0.30/metric/month (first 10 free)
  API calls:        $0.01 per 1,000 GetMetricData calls
  Contributor Insights: $0.02 per rule per hour
  
Dashboards:
  $3/dashboard/month (first 3 free)

Alarms:
  $0.10/alarm/month (first 10 free)

Example: 50-service production system
  Logs: 50 services Ă— 500MB/day Ă— 30days = 750GB
    Ingestion: 750 Ă— $0.50 = $375/month
    Storage (30-day retention): 750GB Ă— $0.03 = $22.50/month
  Custom metrics: 200 Ă— $0.30 = $60/month
  Dashboards: 10 Ă— $3 = $30/month
  Total: ~$487.50/month

Reducing CloudWatch Log Costs with Subscription Filters

import boto3
import json

logs = boto3.client('logs', region_name='us-east-1')

# Strategy 1: Filter logs BEFORE they reach CloudWatch
# Use a Lambda subscription filter to drop DEBUG/INFO logs in production

# Lambda function to filter logs (deployed via CloudFormation)
FILTER_FUNCTION_CODE = """
import base64
import gzip
import json

def handler(event, context):
    # Decode CloudWatch Logs data
    compressed = base64.b64decode(event['awslogs']['data'])
    uncompressed = gzip.decompress(compressed)
    log_data = json.loads(uncompressed)
    
    filtered_events = []
    for log_event in log_data['logEvents']:
        message = log_event['message']
        
        # Only keep WARN, ERROR, CRITICAL — drop DEBUG/INFO
        if any(level in message for level in ['WARN', 'ERROR', 'CRITICAL', 'Exception', 'Traceback']):
            filtered_events.append(log_event)
    
    # Forward filtered events to S3 or another log group
    # Dropping events = CloudWatch never charges for them
    return {'filtered': len(log_data['logEvents']) - len(filtered_events)}
"""

# Strategy 2: Set appropriate log retention (stop paying for old logs)
def set_retention_policies():
    """Set 30-day retention for most log groups, 7-day for debug."""
    paginator = logs.get_paginator('describe_log_groups')
    
    for page in paginator.paginate():
        for group in page['logGroups']:
            name = group['logGroupName']
            current_retention = group.get('retentionInDays', 'Never')
            
            # Never-expiring logs are a cost time bomb
            if current_retention == 'Never':
                if '/debug/' in name or '/dev/' in name:
                    retention = 7
                elif '/staging/' in name:
                    retention = 14
                else:
                    retention = 30
                    
                logs.put_retention_policy(
                    logGroupName=name,
                    retentionInDays=retention
                )
                print(f"Set {name}: {retention} days (was: Never)")

set_retention_policies()
# Find log groups with no retention policy (potential cost bomb)
aws logs describe-log-groups   --query 'logGroups[?!retentionInDays].{Name:logGroupName,Size:storedBytes}'   --output table | sort -k4 -rn | head -20

# Find the most expensive log groups by stored bytes
aws logs describe-log-groups   --query 'sort_by(logGroups, &storedBytes)[-10:].{Name:logGroupName,GB:storedBytes}'   --output table

# Convert bytes to GB for readability
aws logs describe-log-groups   --query 'logGroups[*].[logGroupName, storedBytes]'   --output text | awk '{printf "%-60s %.2f GB
", $1, $2/1073741824}' |   sort -k2 -rn | head -20

Datadog True Cost vs CloudWatch

Cost comparison for 50-node production cluster:

DATADOG (Infrastructure + APM + Logs):
  Infrastructure: 50 hosts Ă— $23/month = $1,150
  APM: 50 hosts Ă— $31/month = $1,550
  Logs ingestion: 750GB Ă— $0.10 = $75
  Log indexing: 750GB Ă— $1.70 = $1,275
  Monthly total: $4,050

CLOUDWATCH:
  Logs: $487/month (as calculated above)
  Container Insights: 50 nodes Ă— $0.35/node/month = $17.50
  X-Ray (tracing): $5/1M traces + $0.50/1M = ~$100/month at 20M traces
  Monthly total: ~$604/month
  
  SAVING vs Datadog: $3,446/month (85%)

SELF-HOSTED (EKS, Prometheus + Grafana + Loki):
  EC2 for stack: 3Ă— t3.large = 3 Ă— $60.48 = $181.44/month
  EBS storage: 1TB Ă— $0.08/GB = $81.92/month
  S3 long-term storage (Thanos/Cortex): ~$20/month
  Operational engineering: 4hr/month Ă— $100 = $400/month
  Monthly total: ~$683/month
  
  Note: At scale (500+ nodes), self-hosted becomes far cheaper
  At 50 nodes: Datadog ($4,050) vs self-hosted ($683) = $3,367 saving

Self-Hosted Prometheus + Grafana + Loki Stack

# Deploy the observability stack with Helm
# This replaces Datadog for most use cases

# Add Helm repos
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm repo add grafana https://grafana.github.io/helm-charts
# helm repo update

# values-prometheus.yaml
global:
  scrape_interval: 30s  # Default 15s — 2x cost reduction on storage

prometheus:
  prometheusSpec:
    retention: 15d  # Keep 15 days local (use Thanos for longer)
    retentionSize: 50GB
    
    # Resource requests — right-sized for 50 nodes
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2
        memory: 8Gi

    # Remote write to S3 via Thanos for long-term storage ($0.023/GB vs $0.30/GB EBS)
    remoteWrite:
      - url: http://thanos-receive:19291/api/v1/receive

  # Don't scrape every pod — use relabeling to drop noisy metrics
  additionalScrapeConfigs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        # Only scrape pods with annotation prometheus.io/scrape=true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true

---
# values-loki.yaml
loki:
  storage:
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      region: us-east-1
      bucketnames: my-loki-logs
      
  limits_config:
    ingestion_rate_mb: 32
    max_streams_per_user: 10000
    retention_period: 30d  # Auto-delete after 30 days
    
  # Compression reduces storage cost 5-10x
  chunk_encoding: snappy

Sampling Strategies for Distributed Tracing

# Don't trace every request — intelligent sampling saves 90% of tracing costs

# AWS X-Ray: sampling rules
import boto3
xray = boto3.client('xray', region_name='us-east-1')

xray.create_sampling_rule(
    SamplingRule={
        'RuleName': 'production-api',
        'Priority': 1,
        'FixedRate': 0.05,      # Sample 5% of requests
        'ReservoirSize': 5,     # Always sample 5 requests/sec regardless of rate
        'ServiceName': 'api',
        'ServiceType': '*',
        'Host': '*',
        'HTTPMethod': '*',
        'URLPath': '*',
        'ResourceARN': '*',
        'Version': 1
    }
)

# Error sampling: always sample errors and slow requests
xray.create_sampling_rule(
    SamplingRule={
        'RuleName': 'always-sample-errors',
        'Priority': 0,          # Higher priority
        'FixedRate': 1.0,       # 100% of errors
        'ReservoirSize': 100,
        'ServiceName': 'api',
        'ServiceType': '*',
        'Host': '*',
        'HTTPMethod': '*',
        'URLPath': '/api/checkout',  # Always trace checkout
        'ResourceARN': '*',
        'Version': 1
    }
)

# At 5% sampling: 20M requests/day → 1M traces sampled
# X-Ray cost: 1M Ă— $5/1M = $5/day = $150/month
# vs unsampled: $2,000/month
# Saving: $1,850/month (92.5%)

Conclusion

Observability costs are controllable with the right strategies. For CloudWatch: set retention policies, filter noisy logs before ingestion, and use subscription filters to drop DEBUG/INFO in production. For expensive Datadog deployments: evaluate whether a self-hosted Prometheus+Grafana+Loki stack meets your needs at 85% lower cost. Use 5% sampling for traces, keeping 100% sampling only for errors and critical paths.

The goal is not to reduce observability — it is to stop paying for data you never look at. Most teams find that 80% of their logs are DEBUG-level noise that adds zero value in production.

D

Daniel Park

AI/ML Engineer focused on practical applications of machine learning in DevOps and cloud operations.

Ready to Transform Your Infrastructure?

Let's discuss how we can help you achieve similar results.