## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. **Design:** - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes **Chain (bottom-up):** 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303
1.5 KiB
1.5 KiB
| title | modified | tags | |||
|---|---|---|---|---|---|
| Runbook: Pod Not Ready | 2026-03-22 |
|
Runbook: Pod Not Ready
Alert name: PodNotReady
A Kubernetes pod has been in a not-ready state for 5+ minutes.
Diagnostic Steps
-
Identify the pod from the alert labels (
pod,namespace):kubectl describe pod <pod> -n <namespace> --context=minikube-indri -
Check events — look for scheduling failures, image pull errors, or probe failures:
kubectl get events -n <namespace> --context=minikube-indri --sort-by='.lastTimestamp' | tail -20 -
Check logs:
kubectl logs <pod> -n <namespace> --context=minikube-indri --tail=50 -
Check node resources:
kubectl top nodes --context=minikube-indri kubectl top pods -n <namespace> --context=minikube-indri
Common Causes
- CrashLoopBackOff — app is crashing on startup, check logs
- ImagePullBackOff — container image not found or registry unreachable
- Pending — insufficient resources (CPU/memory), or PVC not bound
- Readiness probe failing — service is running but not healthy
- NFS mount issue — services depending on sifaka (kiwix, transmission, navidrome, jellyfin) will fail if NFS is down
Silencing
- Grafana → Alerting → Silences → Create Silence
- Match
alertname = PodNotReady - Optionally match
namespace = <namespace>to silence a specific service
Related
- deploy-infra-alerting — Alerting pipeline overview