Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303 )

## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303

2026-03-22 14:52:56 -07:00

871 B

Raw Blame History

title

modified

Observability

Metrics, logs, traces, and dashboards for BlumeOps infrastructure.

Components

prometheus - Metrics storage and querying
loki - Log aggregation
tempo - Distributed tracing
alloy - Metrics, log, and trace collection
grafana - Dashboards and visualization

Alerting

deploy-infra-alerting - Alerting pipeline (Grafana Unified Alerting → ntfy)
runbook-service-probe-failure - Service health check failure runbook
runbook-postgres-unhealthy - PostgreSQL cluster health runbook
runbook-pod-not-ready - Pod not ready runbook
runbook-textfile-stale - Metrics textfile freshness runbook
runbook-frigate-camera-down - Frigate camera health runbook
runbook-argocd-out-of-sync - ArgoCD sync status runbook

871 B Raw Blame History

Observability

Components

Alerting

871 B

Raw Blame History