Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303 )

## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303

2026-03-22 14:52:56 -07:00

2.2 KiB

Raw Permalink Blame History

title

modified

Runbook: Service Probe Failure

Alert name: ServiceProbeFailure

A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.

Affected Services

This alert covers services probed by the Alloy blackbox exporter on indri's minikube cluster:

Service	Health Endpoint
miniflux	`/healthcheck`
kiwix	`/`
transmission	`/transmission/web/`
devpi	`/+api`
argocd	`/healthz`

The failing service is identified by the service label in the alert (extracted from the job label).

Diagnostic Steps

Check which service is down — the alert label service tells you. You can also run:
```
kubectl get pods -n <namespace> --context=minikube-indri
```

Check pod status — look for CrashLoopBackOff, OOMKilled, or pending pods:

kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri

Check pod logs:

kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50

Check if minikube itself is healthy:
```
ssh indri 'minikube status'
```
Check NFS mounts (kiwix, transmission depend on sifaka NFS):
```
ssh indri 'df -h | grep Volumes'
```

Common Causes

Pod crashed — check logs, restart with kubectl delete pod
NFS mount lost — sifaka offline or AutoMounter not running. SSH to indri and check /Volumes/
Resource exhaustion — check kubectl top pods -n <namespace> for memory/CPU pressure
Minikube paused/stopped — ssh indri 'minikube status', restart if needed

Silencing

For planned maintenance, silence this alert in Grafana:

Go to Alerting → Silences → Create Silence
Match label alertname = ServiceProbeFailure
Optionally match service = <specific-service> to silence only one
Set duration for your maintenance window

deploy-infra-alerting — Alerting pipeline overview
configure-grafana-alerting-pipeline — Pipeline configuration

2.2 KiB Raw Permalink Blame History