C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts

- TextfileStale: fires when a .prom textfile on indri hasn't been
  updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic,
  zot, minikube, jellyfin exporters.
- FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m.
- Add runbooks for both alerts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-03-22 13:43:16 -07:00
commit 2fa536e547
4 changed files with 214 additions and 0 deletions

View file

@ -0,0 +1,39 @@
---
title: "Runbook: Frigate Camera Down"
modified: 2026-03-22
tags:
- how-to
- alerting
- runbook
---
# Runbook: Frigate Camera Down
**Alert name:** `FrigateCameraDown`
A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received.
## Diagnostic Steps
1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status
2. **Check Frigate API stats**:
```fish
curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool
```
3. **Check Frigate pod logs** on ringtail:
```fish
kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30
```
4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly.
## Common Causes
- **Camera offline** — power outage, network issue, or camera crash
- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop
- **Frigate pod restart** — during restart, camera FPS briefly drops to 0
- **RTSP stream timeout** — camera firmware issue; power cycle the camera
## Related
- [[frigate]] — Frigate NVR reference
- [[deploy-infra-alerting]] — Alerting pipeline overview

View file

@ -0,0 +1,58 @@
---
title: "Runbook: Textfile Stale"
modified: 2026-03-22
tags:
- how-to
- alerting
- runbook
---
# Runbook: Textfile Stale
**Alert name:** `TextfileStale`
A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running.
## Affected Textfiles
| File | LaunchAgent | What it monitors |
|------|-------------|------------------|
| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status |
| `zot.prom` | `mcquack.eblume.zot` | Container registry |
| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status |
| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server |
## Diagnostic Steps
1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri:
```fish
ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/'
```
2. **Check if the LaunchAgent is running**:
```fish
ssh indri 'launchctl list | grep mcquack'
```
3. **Check LaunchAgent logs** (plist defines stdout/stderr paths):
```fish
ssh indri 'cat ~/Library/Logs/mcquack/<agent-name>.log'
```
4. **Try running the exporter manually**:
```fish
ssh indri 'cat ~/Library/LaunchAgents/mcquack.<agent>.plist'
# Find the ProgramArguments, run them manually
```
## Common Causes
- **LaunchAgent not loaded**`launchctl load ~/Library/LaunchAgents/mcquack.<agent>.plist`
- **Script error** — the exporter script crashed; check logs
- **Permissions** — the textfile directory is not writable
- **Indri reboot** — some LaunchAgents may not auto-start
## Related
- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix`
- [[deploy-infra-alerting]] — Alerting pipeline overview

View file

@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
- [[runbook-service-probe-failure]] - Service health check failure runbook
- [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
- [[runbook-pod-not-ready]] - Pod not ready runbook
- [[runbook-textfile-stale]] - Metrics textfile freshness runbook
- [[runbook-frigate-camera-down]] - Frigate camera health runbook