blumeops/docs/how-to/runbooks/runbook-postgres-unhealthy.md
Erich Blume 6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00

63 lines
1.6 KiB
Markdown

---
title: "Runbook: PostgreSQL Cluster Unhealthy"
modified: 2026-03-22
tags:
- how-to
- alerting
- runbook
---
# Runbook: PostgreSQL Cluster Unhealthy
**Alert name:** `PostgresClusterUnhealthy`
The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.
## Affected Services
The `blumeops-pg` CNPG cluster on indri's minikube runs databases for:
- TeslaMate
- Authentik (cross-cluster from ringtail)
- Immich
- Grafana dashboards (TeslaMate datasource)
## Diagnostic Steps
1. **Check CNPG cluster status**:
```fish
kubectl get cluster blumeops-pg -n databases --context=minikube-indri
kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri
```
2. **Check pod logs**:
```fish
kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30
```
3. **Check if pg_isready**:
```fish
pg_isready -h pg.ops.eblu.me -p 5432
```
4. **Check PVC storage**:
```fish
kubectl get pvc -n databases --context=minikube-indri
```
## Common Causes
- **Pod crash** — OOM, disk full, or configuration error
- **PVC storage full** — check with `kubectl exec` into the pod and `df -h`
- **Minikube issue** — if the node is under memory pressure, CNPG pods may be evicted
- **Network** — Caddy L4 proxy (`pg.ops.eblu.me`) may be misconfigured
## Silencing
For planned database maintenance:
1. Grafana → Alerting → Silences → Create Silence
2. Match `alertname = PostgresClusterUnhealthy`
## Related
- [[postgresql]] — CNPG cluster reference
- [[deploy-infra-alerting]] — Alerting pipeline overview