Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303 )

## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303

2026-03-22 14:52:56 -07:00

1.5 KiB

Raw Blame History

title

modified

Runbook: Pod Not Ready

Alert name: PodNotReady

A Kubernetes pod has been in a not-ready state for 5+ minutes.

Diagnostic Steps

Identify the pod from the alert labels (pod, namespace):

kubectl describe pod <pod> -n <namespace> --context=minikube-indri

Check events — look for scheduling failures, image pull errors, or probe failures:

kubectl get events -n <namespace> --context=minikube-indri --sort-by='.lastTimestamp' | tail -20

Check logs:

kubectl logs <pod> -n <namespace> --context=minikube-indri --tail=50

Check node resources:

kubectl top nodes --context=minikube-indri
kubectl top pods -n <namespace> --context=minikube-indri

Common Causes

CrashLoopBackOff — app is crashing on startup, check logs
ImagePullBackOff — container image not found or registry unreachable
Pending — insufficient resources (CPU/memory), or PVC not bound
Readiness probe failing — service is running but not healthy
NFS mount issue — services depending on sifaka (kiwix, transmission, navidrome, jellyfin) will fail if NFS is down

Silencing

Grafana → Alerting → Silences → Create Silence
Match alertname = PodNotReady
Optionally match namespace = <namespace> to silence a specific service

deploy-infra-alerting — Alerting pipeline overview

1.5 KiB Raw Blame History

Runbook: Pod Not Ready

Diagnostic Steps

Common Causes

Silencing

Related

1.5 KiB

Raw Blame History