Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303 )

## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303

2026-03-22 14:52:56 -07:00

3.3 KiB

Raw Blame History

title

modified

Port services-check Alerts to Grafana

Systematically migrate the health checks from mise run services-check to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today.

What to Do

1. Inventory and Prioritize

Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation.

Already have metrics (easy):

HTTP endpoint probes → Alloy blackbox exporter (probe_success)
PostgreSQL health → CNPG metrics (cnpg_pg_replication_streaming, cnpg_collector_up)
K8s pod health → kube-state-metrics (kube_pod_status_phase)
ArgoCD sync status → ArgoCD metrics (argocd_app_info with sync/health labels)

Need new probes or metrics:

Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes
Metrics textfile freshness → node_textfile_mtime_seconds (already collected by Alloy on indri)
Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe
Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring

2. Add Missing Probes

Extend Alloy configurations where needed:

Alloy on indri: Add blackbox targets for forgejo, zot (local HTTP endpoints)
Alloy on ringtail: Add blackbox targets for ringtail-local services
Consider: Whether public endpoint probing belongs in Fly.io Alloy or a separate prober

3. Create Alert Rules

For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints").

4. Create Runbooks

One runbook per alert type in docs/how-to/runbooks/runbook-<name>.md. Each runbook should cover:

What the alert means
Diagnostic steps
Common fixes
How to silence for planned maintenance

5. Remove from services-check

As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows.

Key Details

Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain
Prioritize checks that have caught real problems in the past
Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases
The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward

Verification

All HTTP endpoint checks from services-check have corresponding alert rules
Pod health checks have corresponding alert rules
PostgreSQL health has a corresponding alert rule
Each alert rule has a runbook doc in docs/how-to/runbooks/
Test at least 2-3 failure scenarios end-to-end
services-check script has been updated to reflect ported checks

first-alert-and-runbook — Prerequisite: established the pattern
deploy-infra-alerting — Parent goal
refactor-services-check-to-query-alerts — Next: make services-check query alerts

3.3 KiB Raw Blame History