The Alert Noise Problem
The average engineering team receives 85 alerts per day (PagerDuty State of Digital Operations 2025). Of those, 52% are false positives or so noisy that engineers ignore them. Alert fatigue is the predictable result: engineers stop responding to pages, real incidents get missed, and the on-call rotation becomes a form of punishment.
The root cause isn't too much monitoring — it's monitoring the wrong things. Good alert design follows one principle: an alert should fire if and only if a human needs to take action right now. Not "this metric looks weird," not "something changed," not "CPU is above a static threshold." A page wakes someone at 3am. If that page doesn't require a human response within minutes, it shouldn't be a page.
Alert Design Principles
The Four Golden Signals (Start Here)
# Latency: How long requests take
# Traffic: How many requests per second
# Errors: Error rate or count
# Saturation: How "full" the service is (CPU, memory, queue depth)
# Alert on SYMPTOMS (user-facing), not CAUSES (internal metrics)
# WRONG: Alert when CPU > 80%
# RIGHT: Alert when error rate > 1% or latency P99 > 2s
# Examples of symptom-based alerts:
- name: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for {{ $labels.service }}"
description: "Current error rate: {{ $value | humanizePercentage }}"
- name: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2.0
for: 5m
labels:
severity: warning
Avoiding Alert Anti-Patterns
# ANTI-PATTERN 1: Static thresholds on variable metrics
# BAD: CPU alert fires during every planned batch job
- alert: HighCPU
expr: node_cpu_usage > 0.8 # ❌ Always fires on batch jobs
# GOOD: Alert on sustained saturation AND error rate together
- alert: CPUSaturationCausingErrors
expr: |
node_cpu_usage > 0.9
AND
sum(rate(http_requests_total{code=~"5.."}[5m])) > 0.05
for: 10m # Must be sustained
# ANTI-PATTERN 2: Too short "for:" duration (flapping)
# BAD: Fires and resolves every few minutes
- alert: ServiceDown
expr: up == 0
for: 0s # ❌ Fires instantly, flaps on restarts
# GOOD: Give service time to restart
- alert: ServiceDown
expr: up == 0
for: 5m # Only alert if down for 5+ minutes
# ANTI-PATTERN 3: Alerting on what you CAN'T fix
# BAD: Alert when upstream API is slow (you can't fix their API)
- alert: UpstreamSlowResponse
expr: external_api_latency > 5 # ❌ Not actionable for you
# GOOD: Alert on your service's handling of the upstream issue
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{service="payment-api"} == 1
for: 2m
annotations:
runbook: "https://runbooks.company.com/circuit-breaker"
Prometheus Recording Rules: Pre-compute Expensive Queries
# rules/recording-rules.yaml
groups:
- name: http_metrics
interval: 1m # Pre-compute every minute
rules:
# Pre-computed request rate by service
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status_code)
# Pre-computed error ratio (used in multiple alerts)
- record: job:http_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# P99 latency by service (expensive without recording rule)
- record: job:http_request_duration_p99:rate5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
# Availability (1 - error rate)
- record: job:http_availability:rate1h
expr: |
1 - (
sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (job)
/
sum(rate(http_requests_total[1h])) by (job)
)
- name: resource_metrics
rules:
# Node memory available percentage
- record: node:memory_available_ratio
expr: |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# Pod CPU throttling ratio
- record: pod:cpu_throttling_ratio:rate5m
expr: |
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod, namespace)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, namespace)
SLO-Based Alerting with Multi-Window Burn Rate
SLO-based alerting is the gold standard from Google's SRE book. Instead of alerting on individual metrics, you alert when you're burning through your error budget too fast to meet your SLO.
# SLO: 99.9% availability (43.2 minutes of downtime/month allowed)
# Error budget: 0.1% of requests can fail
# Multi-window burn rate (catches both fast burns and slow burns)
groups:
- name: slo-alerts
rules:
# Fast burn: consuming 2 hours of budget in 5 minutes
# (14.4x burn rate over 1h window AND 36x over 5m window)
- alert: SLOBurnRateCritical
expr: |
(
job:http_error_ratio:rate1h{job="api"} > (14.4 * 0.001)
AND
job:http_error_ratio:rate5m{job="api"} > (14.4 * 0.001)
)
OR
(
job:http_error_ratio:rate6h{job="api"} > (6 * 0.001)
AND
job:http_error_ratio:rate30m{job="api"} > (6 * 0.001)
)
for: 2m
labels:
severity: critical
team: backend
slo: availability
annotations:
summary: "SLO burn rate critical for {{ $labels.job }}"
description: "Error rate {{ $value | humanizePercentage }} — burning error budget rapidly"
runbook: "https://runbooks.company.com/slo-burn-rate"
dashboard: "https://grafana.company.com/d/slo-dash"
# Slow burn: will exhaust budget in 3 days at current rate
- alert: SLOBurnRateWarning
expr: |
(
job:http_error_ratio:rate1d{job="api"} > (3 * 0.001)
AND
job:http_error_ratio:rate2h{job="api"} > (3 * 0.001)
)
for: 60m
labels:
severity: warning
team: backend
annotations:
summary: "SLO burn rate elevated for {{ $labels.job }}"
description: "At current rate, error budget exhausted in ~3 days"
Alertmanager Configuration
Complete Routing Configuration
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/...'
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password_file: /etc/alertmanager/smtp_password
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
# Default: route all to #ops-alerts Slack
receiver: 'slack-ops'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait between notifications for same group
repeat_interval: 3h # Re-notify after this time if still firing
routes:
# Critical SLO alerts: PagerDuty + Slack
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true # Also match subsequent routes
- match:
severity: critical
receiver: 'slack-critical'
# Database alerts: DBA team
- match_re:
service: '^(postgres|mysql|redis).*'
receiver: 'slack-dba'
group_by: ['alertname', 'service']
# Silenced during maintenance windows
- match:
silenced: 'true'
receiver: 'null'
# Warning: only Slack, no paging
- match:
severity: warning
receiver: 'slack-ops'
repeat_interval: 12h
# Watchdog: used to confirm alerting pipeline is working
- match:
alertname: Watchdog
receiver: 'null'
receivers:
- name: 'null'
- name: 'slack-ops'
slack_configs:
- channel: '#ops-alerts'
title: '{{ template "slack.company.title" . }}'
text: '{{ template "slack.company.text" . }}'
color: '{{ template "slack.company.color" . }}'
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook }}'
- type: button
text: 'Dashboard'
url: '{{ (index .Alerts 0).Annotations.dashboard }}'
- type: button
text: 'Silence 4h'
url: '{{ template "silence_url" . }}'
- name: 'slack-critical'
slack_configs:
- channel: '#incidents'
title: '🚨 CRITICAL: {{ template "slack.company.title" . }}'
color: 'danger'
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
description: '{{ template "pagerduty.company.description" . }}'
links:
- href: '{{ (index .Alerts 0).Annotations.runbook }}'
text: 'Runbook'
- href: '{{ (index .Alerts 0).Annotations.dashboard }}'
text: 'Dashboard'
severity: '{{ if eq (index .Alerts 0).Labels.severity "critical" }}critical{{ else }}warning{{ end }}'
- name: 'slack-dba'
slack_configs:
- channel: '#database-alerts'
title: 'Database Alert: {{ template "slack.company.title" . }}'
inhibit_rules:
# Inhibit warning when critical is firing for same service
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
# Inhibit all alerts when entire cluster is down
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: '.*'
equal: ['cluster']
Alertmanager Templates
{{ /* /etc/alertmanager/templates/slack.tmpl */ }}
{{ define "slack.company.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.SortedPairs.Values | join " " }}
{{ if gt (len .CommonLabels) (len .GroupLabels) }}
({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }})
{{ end }}
{{ end }}
{{ define "slack.company.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}{{ if .Labels.severity }} - {{ .Labels.severity }}{{ end }}
*Description:* {{ .Annotations.description }}
*Service:* {{ .Labels.service }}
*Started:* {{ .StartsAt | since }}
{{ if .Annotations.runbook }}*Runbook:* {{ .Annotations.runbook }}{{ end }}
{{ end }}
{{ end }}
{{ define "slack.company.color" }}
{{ if eq .Status "firing" }}
{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger
{{ else if eq (index .Alerts 0).Labels.severity "warning" }}warning
{{ else }}#439FE0{{ end }}
{{ else }}good{{ end }}
{{ end }}
Grafana Dashboard Best Practices
Runbook-Linked Dashboards
{
"panels": [{
"title": "API Error Rate",
"type": "timeseries",
"links": [
{
"title": "Runbook: High Error Rate",
"url": "https://runbooks.company.com/high-error-rate",
"targetBlank": true
},
{
"title": "Related Logs (Loki)",
"url": "/explore?orgId=1&left=%5B%22now-1h%22%2C%22now%22%2C%22Loki%22%2C%7B%22expr%22%3A%22%7Bapp%3D%5C%22api%5C%22%7D+%7C%3D+%5C%22error%5C%22%22%7D%5D",
"targetBlank": true
}
],
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.005 },
{ "color": "red", "value": 0.01 }
]
}
}]
}
Grafana Alerting (Unified Alerting)
# Grafana 10+ unified alerting (preferred over Prometheus alerts for Grafana-managed infra)
# Alert rule (via Grafana API or Terraform)
resource "grafana_rule_group" "api_alerts" {
name = "API Alerts"
folder_uid = grafana_folder.alerts.uid
interval_seconds = 60
rule {
name = "High Error Rate"
condition = "C"
data {
ref_id = "A"
datasource_uid = "prometheus"
model = jsonencode({
expr = "job:http_error_ratio:rate5m{job='api'}"
intervalMs = 1000
maxDataPoints = 43200
})
}
data {
ref_id = "B"
datasource_uid = "-100" # Expression datasource
model = jsonencode({
conditions = [{
evaluator = { params = [0.01], type = "gt" }
operator = { type = "and" }
query = { params = ["A"] }
reducer = { params = [], type = "last" }
type = "query"
}]
type = "classic_conditions"
})
}
annotations = {
runbook = "https://runbooks.company.com/high-error-rate"
summary = "API error rate above 1%"
}
labels = {
severity = "critical"
team = "backend"
}
notification_settings {
contact_point = "pagerduty-critical"
group_wait = "30s"
group_interval = "5m"
}
}
}
Alert Testing with amtool
# Install amtool
go install github.com/prometheus/alertmanager/cmd/amtool@latest
# Test routing — see which receiver an alert would go to
amtool --alertmanager.url http://localhost:9093 config routes test --verify.receivers=pagerduty-critical severity=critical service=api
# Check current alerts
amtool alert query
# Silence an alert during maintenance
amtool silence add alertname=~".*" environment=production --duration=2h --comment="Scheduled maintenance window - DB upgrade"
# List active silences
amtool silence query
# Expire a silence early
amtool silence expire SILENCE_ID
# Validate alertmanager config
amtool check-config /etc/alertmanager/alertmanager.yml
Alert Runbook Template
# Alert: HighErrorRate
## Summary
API error rate exceeds 1% for more than 5 minutes.
## Impact
Users may be experiencing failures. Check error rate and affected endpoints.
## Severity
Critical — Page on-call immediately
## Diagnosis Steps
### 1. Check which endpoints are failing
Open dashboard: [API Error Dashboard](https://grafana.company.com/d/api-errors)
Or run:
```
sum(rate(http_requests_total{code=~"5.."}[5m])) by (path, method) | sort desc | head 10
```
### 2. Check application logs
```bash
kubectl logs -n production -l app=api --tail=100 | grep ERROR
# Or in Grafana Explore with Loki:
# {app="api"} |= "error" | level = "error"
```
### 3. Check dependencies
- Database: [DB Dashboard](https://grafana.company.com/d/postgres)
- Redis: [Redis Dashboard](https://grafana.company.com/d/redis)
- External APIs: Check circuit breaker status
### 4. Common causes and fixes
| Cause | Symptoms | Fix |
|-------|----------|-----|
| DB connection pool exhausted | High query wait time | Restart app pods, check pool config |
| Memory leak | Rising memory + OOM kills | Rolling restart |
| Upstream API down | Circuit breaker open | Wait for upstream, manual fallback |
| Bad deploy | Errors started after deploy | Rollback: `kubectl rollout undo deployment/api` |
## Escalation
1. Acknowledge within 10 minutes
2. Update #incidents Slack channel
3. If unable to resolve in 30 minutes, escalate to team lead
Conclusion
Great alerting is about ruthless prioritization. Start by auditing every existing alert: delete anything that's been silenced for 30+ days. Convert static threshold alerts to SLO-based burn rate alerts. Add runbook links to every alert. Route critical alerts to PagerDuty, warnings to Slack.
The goal is zero 3am pages for issues that can wait until morning, and instant notification for the small set of issues that genuinely require immediate action. When your on-call engineers trust that every page is actionable, they respond faster, resolve faster, and don't burn out.
Marcus Rodriguez
Lead DevOps Engineer specializing in CI/CD pipelines, container orchestration, and infrastructure automation.