C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts

- TextfileStale: fires when a .prom textfile on indri hasn't been updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic, zot, minikube, jellyfin exporters. - FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m. - Add runbooks for both alerts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 13:43:16 -07:00 · 2026-03-22 13:43:16 -07:00 · 2fa536e547
commit 2fa536e547
parent b2b0d6efa7
4 changed files with 214 additions and 0 deletions
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@ -84,6 +84,121 @@ groups:
                    type: and
              refId: C

+  - orgId: 1
+    name: textfile-freshness
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: textfile-stale
+        title: TextfileStale
+        condition: C
+        for: 15m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale
+        labels:
+          severity: warning
+          service: indri-metrics
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                time() - node_textfile_mtime_seconds > 3600
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: gt
+                    params:
+                      - 0
+                  operator:
+                    type: and
+              refId: C
+
+  - orgId: 1
+    name: frigate-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: frigate-camera-down
+        title: FrigateCameraDown
+        condition: C
+        for: 5m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Frigate camera {{ index $labels "camera_name" }} has 0 FPS
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down
+        labels:
+          severity: warning
+          service: frigate
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: frigate_camera_fps
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
  - orgId: 1
    name: database-health
    folder: Infrastructure Alerts
--- a/docs/how-to/alerts/runbook-frigate-camera-down.md
+++ b/docs/how-to/alerts/runbook-frigate-camera-down.md
@ -0,0 +1,39 @@
+---
+title: "Runbook: Frigate Camera Down"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Frigate Camera Down
+
+**Alert name:** `FrigateCameraDown`
+
+A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received.
+
+## Diagnostic Steps
+
+1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status
+2. **Check Frigate API stats**:
+   ```fish
+   curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool
+   ```
+3. **Check Frigate pod logs** on ringtail:
+   ```fish
+   kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30
+   ```
+4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly.
+
+## Common Causes
+
+- **Camera offline** — power outage, network issue, or camera crash
+- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop
+- **Frigate pod restart** — during restart, camera FPS briefly drops to 0
+- **RTSP stream timeout** — camera firmware issue; power cycle the camera
+
+## Related
+
+- [[frigate]] — Frigate NVR reference
+- [[deploy-infra-alerting]] — Alerting pipeline overview
--- a/docs/how-to/alerts/runbook-textfile-stale.md
+++ b/docs/how-to/alerts/runbook-textfile-stale.md
@ -0,0 +1,58 @@
+---
+title: "Runbook: Textfile Stale"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Textfile Stale
+
+**Alert name:** `TextfileStale`
+
+A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running.
+
+## Affected Textfiles
+
+| File | LaunchAgent | What it monitors |
+|------|-------------|------------------|
+| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status |
+| `zot.prom` | `mcquack.eblume.zot` | Container registry |
+| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status |
+| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server |
+
+## Diagnostic Steps
+
+1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri:
+   ```fish
+   ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/'
+   ```
+
+2. **Check if the LaunchAgent is running**:
+   ```fish
+   ssh indri 'launchctl list | grep mcquack'
+   ```
+
+3. **Check LaunchAgent logs** (plist defines stdout/stderr paths):
+   ```fish
+   ssh indri 'cat ~/Library/Logs/mcquack/<agent-name>.log'
+   ```
+
+4. **Try running the exporter manually**:
+   ```fish
+   ssh indri 'cat ~/Library/LaunchAgents/mcquack.<agent>.plist'
+   # Find the ProgramArguments, run them manually
+   ```
+
+## Common Causes
+
+- **LaunchAgent not loaded** — `launchctl load ~/Library/LaunchAgents/mcquack.<agent>.plist`
+- **Script error** — the exporter script crashed; check logs
+- **Permissions** — the textfile directory is not writable
+- **Indri reboot** — some LaunchAgents may not auto-start
+
+## Related
+
+- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix`
+- [[deploy-infra-alerting]] — Alerting pipeline overview
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
 - [[runbook-service-probe-failure]] - Service health check failure runbook
 - [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
 - [[runbook-pod-not-ready]] - Pod not ready runbook
+- [[runbook-textfile-stale]] - Metrics textfile freshness runbook
+- [[runbook-frigate-camera-down]] - Frigate camera health runbook