C2(deploy-infra-alerting): impl add probes and alert rules for services-check coverage

Extend Alloy blackbox probes: - Add prometheus, loki, grafana, teslamate, immich, navidrome - Now probing 11 services (was 5), covering most HTTP checks from services-check Add alert rules: - PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical) - PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m Add runbooks: - runbook-postgres-unhealthy.md - runbook-pod-not-ready.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 12:11:12 -07:00 · 2026-03-22 12:11:12 -07:00 · 8e6a803076
commit 8e6a803076
parent e33b0bc184
5 changed files with 271 additions and 0 deletions
--- a/argocd/manifests/alloy-k8s/config.alloy
+++ b/argocd/manifests/alloy-k8s/config.alloy
@ -169,6 +169,43 @@ prometheus.exporter.blackbox "services" {
    address = "http://argocd-server.argocd.svc.cluster.local:80/healthz"
    module  = "http_2xx"
  }
+
+  target {
+    name    = "prometheus"
+    address = "http://prometheus.monitoring.svc.cluster.local:9090/-/healthy"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "loki"
+    address = "http://loki.monitoring.svc.cluster.local:3100/ready"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "grafana"
+    address = "http://grafana.monitoring.svc.cluster.local:3000/api/health"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "teslamate"
+    address = "http://teslamate.teslamate.svc.cluster.local:4000/"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "immich"
+    address = "http://immich-server.immich.svc.cluster.local:2283/api/server/ping"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "navidrome"
+    address = "http://navidrome.navidrome.svc.cluster.local:4533/"
+    module  = "http_2xx"
+  }
+
 }

 // Scrape blackbox probe results
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@ -84,6 +84,120 @@ groups:
                    type: and
              refId: C

+  - orgId: 1
+    name: database-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: postgres-cluster-unhealthy
+        title: PostgresClusterUnhealthy
+        condition: C
+        for: 3m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy
+        labels:
+          severity: critical
+          service: postgresql
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: cnpg_collector_up
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
+  - orgId: 1
+    name: pod-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: pod-not-ready
+        title: PodNotReady
+        condition: C
+        for: 5m
+        noDataState: NoData
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready
+        labels:
+          severity: warning
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                kube_pod_status_ready{condition="true"} == 0
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
 templates:
  - orgId: 1
    name: ntfy-infra
--- a/docs/how-to/alerts/runbook-pod-not-ready.md
+++ b/docs/how-to/alerts/runbook-pod-not-ready.md
@ -0,0 +1,55 @@
+---
+title: "Runbook: Pod Not Ready"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Pod Not Ready
+
+**Alert name:** `PodNotReady`
+
+A Kubernetes pod has been in a not-ready state for 5+ minutes.
+
+## Diagnostic Steps
+
+1. **Identify the pod** from the alert labels (`pod`, `namespace`):
+   ```fish
+   kubectl describe pod <pod> -n <namespace> --context=minikube-indri
+   ```
+
+2. **Check events** — look for scheduling failures, image pull errors, or probe failures:
+   ```fish
+   kubectl get events -n <namespace> --context=minikube-indri --sort-by='.lastTimestamp' | tail -20
+   ```
+
+3. **Check logs**:
+   ```fish
+   kubectl logs <pod> -n <namespace> --context=minikube-indri --tail=50
+   ```
+
+4. **Check node resources**:
+   ```fish
+   kubectl top nodes --context=minikube-indri
+   kubectl top pods -n <namespace> --context=minikube-indri
+   ```
+
+## Common Causes
+
+- **CrashLoopBackOff** — app is crashing on startup, check logs
+- **ImagePullBackOff** — container image not found or registry unreachable
+- **Pending** — insufficient resources (CPU/memory), or PVC not bound
+- **Readiness probe failing** — service is running but not healthy
+- **NFS mount issue** — services depending on sifaka (kiwix, transmission, navidrome, jellyfin) will fail if NFS is down
+
+## Silencing
+
+1. Grafana → Alerting → Silences → Create Silence
+2. Match `alertname = PodNotReady`
+3. Optionally match `namespace = <namespace>` to silence a specific service
+
+## Related
+
+- [[deploy-infra-alerting]] — Alerting pipeline overview
--- a/docs/how-to/alerts/runbook-postgres-unhealthy.md
+++ b/docs/how-to/alerts/runbook-postgres-unhealthy.md
@ -0,0 +1,63 @@
+---
+title: "Runbook: PostgreSQL Cluster Unhealthy"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: PostgreSQL Cluster Unhealthy
+
+**Alert name:** `PostgresClusterUnhealthy`
+
+The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.
+
+## Affected Services
+
+The `blumeops-pg` CNPG cluster on indri's minikube runs databases for:
+- TeslaMate
+- Authentik (cross-cluster from ringtail)
+- Immich
+- Grafana dashboards (TeslaMate datasource)
+
+## Diagnostic Steps
+
+1. **Check CNPG cluster status**:
+   ```fish
+   kubectl get cluster blumeops-pg -n databases --context=minikube-indri
+   kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri
+   ```
+
+2. **Check pod logs**:
+   ```fish
+   kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30
+   ```
+
+3. **Check if pg_isready**:
+   ```fish
+   pg_isready -h pg.ops.eblu.me -p 5432
+   ```
+
+4. **Check PVC storage**:
+   ```fish
+   kubectl get pvc -n databases --context=minikube-indri
+   ```
+
+## Common Causes
+
+- **Pod crash** — OOM, disk full, or configuration error
+- **PVC storage full** — check with `kubectl exec` into the pod and `df -h`
+- **Minikube issue** — if the node is under memory pressure, CNPG pods may be evicted
+- **Network** — Caddy L4 proxy (`pg.ops.eblu.me`) may be misconfigured
+
+## Silencing
+
+For planned database maintenance:
+1. Grafana → Alerting → Silences → Create Silence
+2. Match `alertname = PostgresClusterUnhealthy`
+
+## Related
+
+- [[postgresql]] — CNPG cluster reference
+- [[deploy-infra-alerting]] — Alerting pipeline overview
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@ -21,3 +21,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.

 - [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy)
 - [[runbook-service-probe-failure]] - Service health check failure runbook
+- [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
+- [[runbook-pod-not-ready]] - Pod not ready runbook