Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303 )

## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303

2026-03-22 14:52:56 -07:00

2.6 KiB

Raw Blame History

title

modified

First Alert and Runbook

Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.

What to Do

1. Choose the First Alert

The best candidate is a blackbox probe failure because:

Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
The metric probe_success is already in Prometheus
It maps directly to what services-check does (HTTP health checks)
A single alert rule with a service label can cover all probed services

2. Create the Alert Rule

Provision via YAML in the alerting provisioning ConfigMap. The rule should:

Query probe_success == 0 from Prometheus
Fire after the condition persists for 2 minutes (avoid flapping)
Include labels: severity: warning, service: {{ $labels.instance }}
Include annotations: summary, runbook_url pointing to the runbook doc

3. Create the Runbook

Write docs/how-to/runbooks/runbook-service-probe-failure.md as a how-to doc explaining:

What the alert means
How to check which service is down
Common causes and resolution steps
How to silence the alert if the downtime is planned

4. Verify End-to-End

Stop one of the probed services (e.g., scale miniflux to 0)
Wait for the alert to fire (~2 minutes)
Confirm ntfy notification arrives with correct summary and runbook link
Click the runbook link and verify it reaches docs.eblu.me
Scale the service back up
Confirm "resolved" notification arrives
Confirm no repeat notification during the 24h window

Key Details

Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
The blackbox probe metrics from Alloy use the job name blackbox and include an instance label with the service name
The runbook URL format: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure

Verification

Alert rule appears in Grafana UI under Alerting → Alert Rules
Simulated failure triggers ntfy notification within ~3 minutes
Notification includes service name, summary, and clickable runbook link
Resolution triggers a "resolved" notification
No repeat notification within 24h window

configure-grafana-alerting-pipeline — Prerequisite: pipeline must be working
deploy-infra-alerting — Parent goal
port-services-check-alerts — Next: port remaining checks
runbook-service-probe-failure — The runbook created for this alert

2.6 KiB Raw Blame History