blumeops/docs/how-to/runbooks/first-alert-and-runbook.md

68 lines
2.6 KiB
Markdown
Raw Permalink Normal View History

---
title: First Alert and Runbook
modified: 2026-03-22
tags:
- how-to
- alerting
---
# First Alert and Runbook
Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.
## What to Do
### 1. Choose the First Alert
The best candidate is a **blackbox probe failure** because:
- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
- The metric `probe_success` is already in Prometheus
- It maps directly to what services-check does (HTTP health checks)
- A single alert rule with a `service` label can cover all probed services
### 2. Create the Alert Rule
Provision via YAML in the alerting provisioning ConfigMap. The rule should:
- Query `probe_success == 0` from Prometheus
- Fire after the condition persists for 2 minutes (avoid flapping)
- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
- Include annotations: `summary`, `runbook_url` pointing to the runbook doc
### 3. Create the Runbook
Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining:
- What the alert means
- How to check which service is down
- Common causes and resolution steps
- How to silence the alert if the downtime is planned
### 4. Verify End-to-End
- Stop one of the probed services (e.g., scale miniflux to 0)
- Wait for the alert to fire (~2 minutes)
- Confirm ntfy notification arrives with correct summary and runbook link
- Click the runbook link and verify it reaches docs.eblu.me
- Scale the service back up
- Confirm "resolved" notification arrives
- Confirm no repeat notification during the 24h window
## Key Details
- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure`
## Verification
- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules
- [ ] Simulated failure triggers ntfy notification within ~3 minutes
- [ ] Notification includes service name, summary, and clickable runbook link
- [ ] Resolution triggers a "resolved" notification
- [ ] No repeat notification within 24h window
## Related
- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
- [[deploy-infra-alerting]] — Parent goal
- [[port-services-check-alerts]] — Next: port remaining checks
- [[runbook-service-probe-failure]] — The runbook created for this alert