## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. **Design:** - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes **Chain (bottom-up):** 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303
2.2 KiB
2.2 KiB
| title | modified | tags | |||
|---|---|---|---|---|---|
| Runbook: Service Probe Failure | 2026-03-22 |
|
Runbook: Service Probe Failure
Alert name: ServiceProbeFailure
A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.
Affected Services
This alert covers services probed by the Alloy blackbox exporter on indri's minikube cluster:
| Service | Health Endpoint |
|---|---|
| miniflux | /healthcheck |
| kiwix | / |
| transmission | /transmission/web/ |
| devpi | /+api |
| argocd | /healthz |
The failing service is identified by the service label in the alert (extracted from the job label).
Diagnostic Steps
-
Check which service is down — the alert label
servicetells you. You can also run:kubectl get pods -n <namespace> --context=minikube-indri -
Check pod status — look for CrashLoopBackOff, OOMKilled, or pending pods:
kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri -
Check pod logs:
kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50 -
Check if minikube itself is healthy:
ssh indri 'minikube status' -
Check NFS mounts (kiwix, transmission depend on sifaka NFS):
ssh indri 'df -h | grep Volumes'
Common Causes
- Pod crashed — check logs, restart with
kubectl delete pod - NFS mount lost — sifaka offline or AutoMounter not running. SSH to indri and check
/Volumes/ - Resource exhaustion — check
kubectl top pods -n <namespace>for memory/CPU pressure - Minikube paused/stopped —
ssh indri 'minikube status', restart if needed
Silencing
For planned maintenance, silence this alert in Grafana:
- Go to Alerting → Silences → Create Silence
- Match label
alertname = ServiceProbeFailure - Optionally match
service = <specific-service>to silence only one - Set duration for your maintenance window
Related
- deploy-infra-alerting — Alerting pipeline overview
- configure-grafana-alerting-pipeline — Pipeline configuration