## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. **Design:** - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes **Chain (bottom-up):** 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303
1.6 KiB
1.6 KiB
| title | modified | tags | |||
|---|---|---|---|---|---|
| Runbook: PostgreSQL Cluster Unhealthy | 2026-03-22 |
|
Runbook: PostgreSQL Cluster Unhealthy
Alert name: PostgresClusterUnhealthy
The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.
Affected Services
The blumeops-pg CNPG cluster on indri's minikube runs databases for:
- TeslaMate
- Authentik (cross-cluster from ringtail)
- Immich
- Grafana dashboards (TeslaMate datasource)
Diagnostic Steps
-
Check CNPG cluster status:
kubectl get cluster blumeops-pg -n databases --context=minikube-indri kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri -
Check pod logs:
kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30 -
Check if pg_isready:
pg_isready -h pg.ops.eblu.me -p 5432 -
Check PVC storage:
kubectl get pvc -n databases --context=minikube-indri
Common Causes
- Pod crash — OOM, disk full, or configuration error
- PVC storage full — check with
kubectl execinto the pod anddf -h - Minikube issue — if the node is under memory pressure, CNPG pods may be evicted
- Network — Caddy L4 proxy (
pg.ops.eblu.me) may be misconfigured
Silencing
For planned database maintenance:
- Grafana → Alerting → Silences → Create Silence
- Match
alertname = PostgresClusterUnhealthy
Related
- postgresql — CNPG cluster reference
- deploy-infra-alerting — Alerting pipeline overview