diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 3714bf1..dfcc5a3 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -84,6 +84,121 @@ groups: type: and refId: C + - orgId: 1 + name: textfile-freshness + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: textfile-stale + title: TextfileStale + condition: C + for: 15m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale + labels: + severity: warning + service: indri-metrics + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: >- + time() - node_textfile_mtime_seconds > 3600 + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: gt + params: + - 0 + operator: + type: and + refId: C + + - orgId: 1 + name: frigate-health + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: frigate-camera-down + title: FrigateCameraDown + condition: C + for: 5m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + Frigate camera {{ index $labels "camera_name" }} has 0 FPS + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down + labels: + severity: warning + service: frigate + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: frigate_camera_fps + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: lt + params: + - 1 + operator: + type: and + refId: C + - orgId: 1 name: database-health folder: Infrastructure Alerts diff --git a/docs/how-to/alerts/runbook-frigate-camera-down.md b/docs/how-to/alerts/runbook-frigate-camera-down.md new file mode 100644 index 0000000..ea04e79 --- /dev/null +++ b/docs/how-to/alerts/runbook-frigate-camera-down.md @@ -0,0 +1,39 @@ +--- +title: "Runbook: Frigate Camera Down" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Frigate Camera Down + +**Alert name:** `FrigateCameraDown` + +A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received. + +## Diagnostic Steps + +1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status +2. **Check Frigate API stats**: + ```fish + curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool + ``` +3. **Check Frigate pod logs** on ringtail: + ```fish + kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30 + ``` +4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly. + +## Common Causes + +- **Camera offline** — power outage, network issue, or camera crash +- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop +- **Frigate pod restart** — during restart, camera FPS briefly drops to 0 +- **RTSP stream timeout** — camera firmware issue; power cycle the camera + +## Related + +- [[frigate]] — Frigate NVR reference +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/how-to/alerts/runbook-textfile-stale.md b/docs/how-to/alerts/runbook-textfile-stale.md new file mode 100644 index 0000000..2a70adf --- /dev/null +++ b/docs/how-to/alerts/runbook-textfile-stale.md @@ -0,0 +1,58 @@ +--- +title: "Runbook: Textfile Stale" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Textfile Stale + +**Alert name:** `TextfileStale` + +A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running. + +## Affected Textfiles + +| File | LaunchAgent | What it monitors | +|------|-------------|------------------| +| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status | +| `zot.prom` | `mcquack.eblume.zot` | Container registry | +| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status | +| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server | + +## Diagnostic Steps + +1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri: + ```fish + ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/' + ``` + +2. **Check if the LaunchAgent is running**: + ```fish + ssh indri 'launchctl list | grep mcquack' + ``` + +3. **Check LaunchAgent logs** (plist defines stdout/stderr paths): + ```fish + ssh indri 'cat ~/Library/Logs/mcquack/.log' + ``` + +4. **Try running the exporter manually**: + ```fish + ssh indri 'cat ~/Library/LaunchAgents/mcquack..plist' + # Find the ProgramArguments, run them manually + ``` + +## Common Causes + +- **LaunchAgent not loaded** — `launchctl load ~/Library/LaunchAgents/mcquack..plist` +- **Script error** — the exporter script crashed; check logs +- **Permissions** — the textfile directory is not writable +- **Indri reboot** — some LaunchAgents may not auto-start + +## Related + +- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix` +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 1aae7b9..9d4a7a0 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure. - [[runbook-service-probe-failure]] - Service health check failure runbook - [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook - [[runbook-pod-not-ready]] - Pod not ready runbook +- [[runbook-textfile-stale]] - Metrics textfile freshness runbook +- [[runbook-frigate-camera-down]] - Frigate camera health runbook