blumeops/docs/how-to/runbooks/first-alert-and-runbook.md
Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00

2.6 KiB

title modified tags
First Alert and Runbook 2026-03-22
how-to
alerting

First Alert and Runbook

Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.

What to Do

1. Choose the First Alert

The best candidate is a blackbox probe failure because:

  • Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
  • The metric probe_success is already in Prometheus
  • It maps directly to what services-check does (HTTP health checks)
  • A single alert rule with a service label can cover all probed services

2. Create the Alert Rule

Provision via YAML in the alerting provisioning ConfigMap. The rule should:

  • Query probe_success == 0 from Prometheus
  • Fire after the condition persists for 2 minutes (avoid flapping)
  • Include labels: severity: warning, service: {{ $labels.instance }}
  • Include annotations: summary, runbook_url pointing to the runbook doc

3. Create the Runbook

Write docs/how-to/runbooks/runbook-service-probe-failure.md as a how-to doc explaining:

  • What the alert means
  • How to check which service is down
  • Common causes and resolution steps
  • How to silence the alert if the downtime is planned

4. Verify End-to-End

  • Stop one of the probed services (e.g., scale miniflux to 0)
  • Wait for the alert to fire (~2 minutes)
  • Confirm ntfy notification arrives with correct summary and runbook link
  • Click the runbook link and verify it reaches docs.eblu.me
  • Scale the service back up
  • Confirm "resolved" notification arrives
  • Confirm no repeat notification during the 24h window

Key Details

  • Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
  • The blackbox probe metrics from Alloy use the job name blackbox and include an instance label with the service name
  • The runbook URL format: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure

Verification

  • Alert rule appears in Grafana UI under Alerting → Alert Rules
  • Simulated failure triggers ntfy notification within ~3 minutes
  • Notification includes service name, summary, and clickable runbook link
  • Resolution triggers a "resolved" notification
  • No repeat notification within 24h window