## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. **Design:** - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes **Chain (bottom-up):** 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303
63 lines
1.6 KiB
Markdown
63 lines
1.6 KiB
Markdown
---
|
|
title: "Runbook: PostgreSQL Cluster Unhealthy"
|
|
modified: 2026-03-22
|
|
tags:
|
|
- how-to
|
|
- alerting
|
|
- runbook
|
|
---
|
|
|
|
# Runbook: PostgreSQL Cluster Unhealthy
|
|
|
|
**Alert name:** `PostgresClusterUnhealthy`
|
|
|
|
The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.
|
|
|
|
## Affected Services
|
|
|
|
The `blumeops-pg` CNPG cluster on indri's minikube runs databases for:
|
|
- TeslaMate
|
|
- Authentik (cross-cluster from ringtail)
|
|
- Immich
|
|
- Grafana dashboards (TeslaMate datasource)
|
|
|
|
## Diagnostic Steps
|
|
|
|
1. **Check CNPG cluster status**:
|
|
```fish
|
|
kubectl get cluster blumeops-pg -n databases --context=minikube-indri
|
|
kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri
|
|
```
|
|
|
|
2. **Check pod logs**:
|
|
```fish
|
|
kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30
|
|
```
|
|
|
|
3. **Check if pg_isready**:
|
|
```fish
|
|
pg_isready -h pg.ops.eblu.me -p 5432
|
|
```
|
|
|
|
4. **Check PVC storage**:
|
|
```fish
|
|
kubectl get pvc -n databases --context=minikube-indri
|
|
```
|
|
|
|
## Common Causes
|
|
|
|
- **Pod crash** — OOM, disk full, or configuration error
|
|
- **PVC storage full** — check with `kubectl exec` into the pod and `df -h`
|
|
- **Minikube issue** — if the node is under memory pressure, CNPG pods may be evicted
|
|
- **Network** — Caddy L4 proxy (`pg.ops.eblu.me`) may be misconfigured
|
|
|
|
## Silencing
|
|
|
|
For planned database maintenance:
|
|
1. Grafana → Alerting → Silences → Create Silence
|
|
2. Match `alertname = PostgresClusterUnhealthy`
|
|
|
|
## Related
|
|
|
|
- [[postgresql]] — CNPG cluster reference
|
|
- [[deploy-infra-alerting]] — Alerting pipeline overview
|