Prometheus and Grafana Alerting: From Noisy to Actionable in Production

The Alert Noise Problem

The average engineering team receives 85 alerts per day (PagerDuty State of Digital Operations 2025). Of those, 52% are false positives or so noisy that engineers ignore them. Alert fatigue is the predictable result: engineers stop responding to pages, real incidents get missed, and the on-call rotation becomes a form of punishment.

The root cause isn't too much monitoring — it's monitoring the wrong things. Good alert design follows one principle: an alert should fire if and only if a human needs to take action right now. Not "this metric looks weird," not "something changed," not "CPU is above a static threshold." A page wakes someone at 3am. If that page doesn't require a human response within minutes, it shouldn't be a page.

Alert Design Principles

The Four Golden Signals (Start Here)

# Latency: How long requests take
# Traffic: How many requests per second
# Errors: Error rate or count
# Saturation: How "full" the service is (CPU, memory, queue depth)

# Alert on SYMPTOMS (user-facing), not CAUSES (internal metrics)
# WRONG: Alert when CPU > 80%
# RIGHT: Alert when error rate > 1% or latency P99 > 2s

# Examples of symptom-based alerts:
- name: HighErrorRate
  expr: |
    sum(rate(http_requests_total{code=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1% for {{ $labels.service }}"
    description: "Current error rate: {{ $value | humanizePercentage }}"

- name: HighLatency
  expr: |
    histogram_quantile(0.99, 
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
    ) > 2.0
  for: 5m
  labels:
    severity: warning

Avoiding Alert Anti-Patterns

# ANTI-PATTERN 1: Static thresholds on variable metrics
# BAD: CPU alert fires during every planned batch job
- alert: HighCPU
  expr: node_cpu_usage > 0.8  # ❌ Always fires on batch jobs
  
# GOOD: Alert on sustained saturation AND error rate together
- alert: CPUSaturationCausingErrors
  expr: |
    node_cpu_usage > 0.9
    AND
    sum(rate(http_requests_total{code=~"5.."}[5m])) > 0.05
  for: 10m  # Must be sustained

# ANTI-PATTERN 2: Too short "for:" duration (flapping)
# BAD: Fires and resolves every few minutes
- alert: ServiceDown
  expr: up == 0
  for: 0s  # ❌ Fires instantly, flaps on restarts

# GOOD: Give service time to restart
- alert: ServiceDown
  expr: up == 0
  for: 5m  # Only alert if down for 5+ minutes
  
# ANTI-PATTERN 3: Alerting on what you CAN'T fix
# BAD: Alert when upstream API is slow (you can't fix their API)
- alert: UpstreamSlowResponse
  expr: external_api_latency > 5  # ❌ Not actionable for you

# GOOD: Alert on your service's handling of the upstream issue
- alert: CircuitBreakerOpen
  expr: circuit_breaker_state{service="payment-api"} == 1
  for: 2m
  annotations:
    runbook: "https://runbooks.company.com/circuit-breaker"

Prometheus Recording Rules: Pre-compute Expensive Queries

# rules/recording-rules.yaml
groups:
  - name: http_metrics
    interval: 1m  # Pre-compute every minute
    rules:
      # Pre-computed request rate by service
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status_code)
      
      # Pre-computed error ratio (used in multiple alerts)
      - record: job:http_error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
      
      # P99 latency by service (expensive without recording rule)
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )
      
      # Availability (1 - error rate)
      - record: job:http_availability:rate1h
        expr: |
          1 - (
            sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (job)
            /
            sum(rate(http_requests_total[1h])) by (job)
          )

  - name: resource_metrics
    rules:
      # Node memory available percentage
      - record: node:memory_available_ratio
        expr: |
          node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
      
      # Pod CPU throttling ratio
      - record: pod:cpu_throttling_ratio:rate5m
        expr: |
          sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod, namespace)
          /
          sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, namespace)

SLO-Based Alerting with Multi-Window Burn Rate

SLO-based alerting is the gold standard from Google's SRE book. Instead of alerting on individual metrics, you alert when you're burning through your error budget too fast to meet your SLO.

# SLO: 99.9% availability (43.2 minutes of downtime/month allowed)
# Error budget: 0.1% of requests can fail

# Multi-window burn rate (catches both fast burns and slow burns)
groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming 2 hours of budget in 5 minutes
      # (14.4x burn rate over 1h window AND 36x over 5m window)
      - alert: SLOBurnRateCritical
        expr: |
          (
            job:http_error_ratio:rate1h{job="api"} > (14.4 * 0.001)
            AND
            job:http_error_ratio:rate5m{job="api"} > (14.4 * 0.001)
          )
          OR
          (
            job:http_error_ratio:rate6h{job="api"} > (6 * 0.001)
            AND
            job:http_error_ratio:rate30m{job="api"} > (6 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
          team: backend
          slo: availability
        annotations:
          summary: "SLO burn rate critical for {{ $labels.job }}"
          description: "Error rate {{ $value | humanizePercentage }} — burning error budget rapidly"
          runbook: "https://runbooks.company.com/slo-burn-rate"
          dashboard: "https://grafana.company.com/d/slo-dash"
      
      # Slow burn: will exhaust budget in 3 days at current rate
      - alert: SLOBurnRateWarning
        expr: |
          (
            job:http_error_ratio:rate1d{job="api"} > (3 * 0.001)
            AND
            job:http_error_ratio:rate2h{job="api"} > (3 * 0.001)
          )
        for: 60m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "SLO burn rate elevated for {{ $labels.job }}"
          description: "At current rate, error budget exhausted in ~3 days"

Alertmanager Configuration

Complete Routing Configuration

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password_file: /etc/alertmanager/smtp_password

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  # Default: route all to #ops-alerts Slack
  receiver: 'slack-ops'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s      # Wait before sending first notification
  group_interval: 5m   # Wait between notifications for same group
  repeat_interval: 3h  # Re-notify after this time if still firing
  
  routes:
    # Critical SLO alerts: PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true  # Also match subsequent routes
    
    - match:
        severity: critical
      receiver: 'slack-critical'
    
    # Database alerts: DBA team
    - match_re:
        service: '^(postgres|mysql|redis).*'
      receiver: 'slack-dba'
      group_by: ['alertname', 'service']
    
    # Silenced during maintenance windows
    - match:
        silenced: 'true'
      receiver: 'null'
    
    # Warning: only Slack, no paging
    - match:
        severity: warning
      receiver: 'slack-ops'
      repeat_interval: 12h
    
    # Watchdog: used to confirm alerting pipeline is working
    - match:
        alertname: Watchdog
      receiver: 'null'

receivers:
  - name: 'null'

  - name: 'slack-ops'
    slack_configs:
      - channel: '#ops-alerts'
        title: '{{ template "slack.company.title" . }}'
        text: '{{ template "slack.company.text" . }}'
        color: '{{ template "slack.company.color" . }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ (index .Alerts 0).Annotations.runbook }}'
          - type: button
            text: 'Dashboard'
            url: '{{ (index .Alerts 0).Annotations.dashboard }}'
          - type: button
            text: 'Silence 4h'
            url: '{{ template "silence_url" . }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#incidents'
        title: '🚨 CRITICAL: {{ template "slack.company.title" . }}'
        color: 'danger'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
        description: '{{ template "pagerduty.company.description" . }}'
        links:
          - href: '{{ (index .Alerts 0).Annotations.runbook }}'
            text: 'Runbook'
          - href: '{{ (index .Alerts 0).Annotations.dashboard }}'
            text: 'Dashboard'
        severity: '{{ if eq (index .Alerts 0).Labels.severity "critical" }}critical{{ else }}warning{{ end }}'

  - name: 'slack-dba'
    slack_configs:
      - channel: '#database-alerts'
        title: 'Database Alert: {{ template "slack.company.title" . }}'

inhibit_rules:
  # Inhibit warning when critical is firing for same service
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']
  
  # Inhibit all alerts when entire cluster is down
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: '.*'
    equal: ['cluster']

Alertmanager Templates

{{ /* /etc/alertmanager/templates/slack.tmpl */ }}

{{ define "slack.company.title" }}
  [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
  {{ .GroupLabels.SortedPairs.Values | join " " }}
  {{ if gt (len .CommonLabels) (len .GroupLabels) }}
    ({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }})
  {{ end }}
{{ end }}

{{ define "slack.company.text" }}
  {{ range .Alerts }}
    *Alert:* {{ .Annotations.summary }}{{ if .Labels.severity }} - {{ .Labels.severity }}{{ end }}
    *Description:* {{ .Annotations.description }}
    *Service:* {{ .Labels.service }}
    *Started:* {{ .StartsAt | since }}
    {{ if .Annotations.runbook }}*Runbook:* {{ .Annotations.runbook }}{{ end }}
  {{ end }}
{{ end }}

{{ define "slack.company.color" }}
  {{ if eq .Status "firing" }}
    {{ if eq (index .Alerts 0).Labels.severity "critical" }}danger
    {{ else if eq (index .Alerts 0).Labels.severity "warning" }}warning
    {{ else }}#439FE0{{ end }}
  {{ else }}good{{ end }}
{{ end }}

Grafana Dashboard Best Practices

Runbook-Linked Dashboards

{
  "panels": [{
    "title": "API Error Rate",
    "type": "timeseries",
    "links": [
      {
        "title": "Runbook: High Error Rate",
        "url": "https://runbooks.company.com/high-error-rate",
        "targetBlank": true
      },
      {
        "title": "Related Logs (Loki)",
        "url": "/explore?orgId=1&left=%5B%22now-1h%22%2C%22now%22%2C%22Loki%22%2C%7B%22expr%22%3A%22%7Bapp%3D%5C%22api%5C%22%7D+%7C%3D+%5C%22error%5C%22%22%7D%5D",
        "targetBlank": true
      }
    ],
    "thresholds": {
      "steps": [
        { "color": "green", "value": null },
        { "color": "yellow", "value": 0.005 },
        { "color": "red", "value": 0.01 }
      ]
    }
  }]
}

Grafana Alerting (Unified Alerting)

# Grafana 10+ unified alerting (preferred over Prometheus alerts for Grafana-managed infra)

# Alert rule (via Grafana API or Terraform)
resource "grafana_rule_group" "api_alerts" {
  name             = "API Alerts"
  folder_uid       = grafana_folder.alerts.uid
  interval_seconds = 60

  rule {
    name      = "High Error Rate"
    condition = "C"
    
    data {
      ref_id = "A"
      datasource_uid = "prometheus"
      model = jsonencode({
        expr = "job:http_error_ratio:rate5m{job='api'}"
        intervalMs = 1000
        maxDataPoints = 43200
      })
    }
    
    data {
      ref_id = "B"
      datasource_uid = "-100"  # Expression datasource
      model = jsonencode({
        conditions = [{
          evaluator = { params = [0.01], type = "gt" }
          operator = { type = "and" }
          query = { params = ["A"] }
          reducer = { params = [], type = "last" }
          type = "query"
        }]
        type = "classic_conditions"
      })
    }
    
    annotations = {
      runbook = "https://runbooks.company.com/high-error-rate"
      summary = "API error rate above 1%"
    }
    
    labels = {
      severity = "critical"
      team     = "backend"
    }
    
    notification_settings {
      contact_point = "pagerduty-critical"
      group_wait    = "30s"
      group_interval = "5m"
    }
  }
}

Alert Testing with amtool

# Install amtool
go install github.com/prometheus/alertmanager/cmd/amtool@latest

# Test routing — see which receiver an alert would go to
amtool --alertmanager.url http://localhost:9093 config routes test   --verify.receivers=pagerduty-critical   severity=critical service=api

# Check current alerts
amtool alert query

# Silence an alert during maintenance
amtool silence add   alertname=~".*"   environment=production   --duration=2h   --comment="Scheduled maintenance window - DB upgrade"

# List active silences
amtool silence query

# Expire a silence early
amtool silence expire SILENCE_ID

# Validate alertmanager config
amtool check-config /etc/alertmanager/alertmanager.yml

Alert Runbook Template

# Alert: HighErrorRate

## Summary
API error rate exceeds 1% for more than 5 minutes.

## Impact
Users may be experiencing failures. Check error rate and affected endpoints.

## Severity
Critical — Page on-call immediately

## Diagnosis Steps

### 1. Check which endpoints are failing
Open dashboard: [API Error Dashboard](https://grafana.company.com/d/api-errors)
Or run:
```
sum(rate(http_requests_total{code=~"5.."}[5m])) by (path, method) | sort desc | head 10
```

### 2. Check application logs
```bash
kubectl logs -n production -l app=api --tail=100 | grep ERROR
# Or in Grafana Explore with Loki:
# {app="api"} |= "error" | level = "error"
```

### 3. Check dependencies
- Database: [DB Dashboard](https://grafana.company.com/d/postgres)
- Redis: [Redis Dashboard](https://grafana.company.com/d/redis)
- External APIs: Check circuit breaker status

### 4. Common causes and fixes
| Cause | Symptoms | Fix |
|-------|----------|-----|
| DB connection pool exhausted | High query wait time | Restart app pods, check pool config |
| Memory leak | Rising memory + OOM kills | Rolling restart |
| Upstream API down | Circuit breaker open | Wait for upstream, manual fallback |
| Bad deploy | Errors started after deploy | Rollback: `kubectl rollout undo deployment/api` |

## Escalation
1. Acknowledge within 10 minutes
2. Update #incidents Slack channel
3. If unable to resolve in 30 minutes, escalate to team lead

Conclusion

Great alerting is about ruthless prioritization. Start by auditing every existing alert: delete anything that's been silenced for 30+ days. Convert static threshold alerts to SLO-based burn rate alerts. Add runbook links to every alert. Route critical alerts to PagerDuty, warnings to Slack.

The goal is zero 3am pages for issues that can wait until morning, and instant notification for the small set of issues that genuinely require immediate action. When your on-call engineers trust that every page is actionable, they respond faster, resolve faster, and don't burn out.