C2: Deploy infrastructure alerting pipeline #303
4 changed files with 214 additions and 0 deletions
C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts
- TextfileStale: fires when a .prom textfile on indri hasn't been updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic, zot, minikube, jellyfin exporters. - FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m. - Add runbooks for both alerts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
commit
2fa536e547
|
|
@ -84,6 +84,121 @@ groups:
|
|||
type: and
|
||||
refId: C
|
||||
|
||||
- orgId: 1
|
||||
name: textfile-freshness
|
||||
folder: Infrastructure Alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
- uid: textfile-stale
|
||||
title: TextfileStale
|
||||
condition: C
|
||||
for: 15m
|
||||
noDataState: Alerting
|
||||
execErrState: Alerting
|
||||
annotations:
|
||||
summary: >-
|
||||
Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour
|
||||
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale
|
||||
labels:
|
||||
severity: warning
|
||||
service: indri-metrics
|
||||
data:
|
||||
- refId: A
|
||||
datasourceUid: prometheus
|
||||
relativeTimeRange:
|
||||
from: 300
|
||||
to: 0
|
||||
model:
|
||||
expr: >-
|
||||
time() - node_textfile_mtime_seconds > 3600
|
||||
interval: ""
|
||||
refId: A
|
||||
- refId: B
|
||||
datasourceUid: "__expr__"
|
||||
relativeTimeRange:
|
||||
from: 0
|
||||
to: 0
|
||||
model:
|
||||
type: reduce
|
||||
expression: A
|
||||
reducer: last
|
||||
settings:
|
||||
mode: dropNN
|
||||
refId: B
|
||||
- refId: C
|
||||
datasourceUid: "__expr__"
|
||||
relativeTimeRange:
|
||||
from: 0
|
||||
to: 0
|
||||
model:
|
||||
type: threshold
|
||||
expression: B
|
||||
conditions:
|
||||
- evaluator:
|
||||
type: gt
|
||||
params:
|
||||
- 0
|
||||
operator:
|
||||
type: and
|
||||
refId: C
|
||||
|
||||
- orgId: 1
|
||||
name: frigate-health
|
||||
folder: Infrastructure Alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
- uid: frigate-camera-down
|
||||
title: FrigateCameraDown
|
||||
condition: C
|
||||
for: 5m
|
||||
noDataState: Alerting
|
||||
execErrState: Alerting
|
||||
annotations:
|
||||
summary: >-
|
||||
Frigate camera {{ index $labels "camera_name" }} has 0 FPS
|
||||
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down
|
||||
labels:
|
||||
severity: warning
|
||||
service: frigate
|
||||
data:
|
||||
- refId: A
|
||||
datasourceUid: prometheus
|
||||
relativeTimeRange:
|
||||
from: 300
|
||||
to: 0
|
||||
model:
|
||||
expr: frigate_camera_fps
|
||||
interval: ""
|
||||
refId: A
|
||||
- refId: B
|
||||
datasourceUid: "__expr__"
|
||||
relativeTimeRange:
|
||||
from: 0
|
||||
to: 0
|
||||
model:
|
||||
type: reduce
|
||||
expression: A
|
||||
reducer: last
|
||||
settings:
|
||||
mode: dropNN
|
||||
refId: B
|
||||
- refId: C
|
||||
datasourceUid: "__expr__"
|
||||
relativeTimeRange:
|
||||
from: 0
|
||||
to: 0
|
||||
model:
|
||||
type: threshold
|
||||
expression: B
|
||||
conditions:
|
||||
- evaluator:
|
||||
type: lt
|
||||
params:
|
||||
- 1
|
||||
operator:
|
||||
type: and
|
||||
refId: C
|
||||
|
||||
- orgId: 1
|
||||
name: database-health
|
||||
folder: Infrastructure Alerts
|
||||
|
|
|
|||
39
docs/how-to/alerts/runbook-frigate-camera-down.md
Normal file
39
docs/how-to/alerts/runbook-frigate-camera-down.md
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
---
|
||||
title: "Runbook: Frigate Camera Down"
|
||||
modified: 2026-03-22
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
- runbook
|
||||
---
|
||||
|
||||
# Runbook: Frigate Camera Down
|
||||
|
||||
**Alert name:** `FrigateCameraDown`
|
||||
|
||||
A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received.
|
||||
|
||||
## Diagnostic Steps
|
||||
|
||||
1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status
|
||||
2. **Check Frigate API stats**:
|
||||
```fish
|
||||
curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool
|
||||
```
|
||||
3. **Check Frigate pod logs** on ringtail:
|
||||
```fish
|
||||
kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30
|
||||
```
|
||||
4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- **Camera offline** — power outage, network issue, or camera crash
|
||||
- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop
|
||||
- **Frigate pod restart** — during restart, camera FPS briefly drops to 0
|
||||
- **RTSP stream timeout** — camera firmware issue; power cycle the camera
|
||||
|
||||
## Related
|
||||
|
||||
- [[frigate]] — Frigate NVR reference
|
||||
- [[deploy-infra-alerting]] — Alerting pipeline overview
|
||||
58
docs/how-to/alerts/runbook-textfile-stale.md
Normal file
58
docs/how-to/alerts/runbook-textfile-stale.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
title: "Runbook: Textfile Stale"
|
||||
modified: 2026-03-22
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
- runbook
|
||||
---
|
||||
|
||||
# Runbook: Textfile Stale
|
||||
|
||||
**Alert name:** `TextfileStale`
|
||||
|
||||
A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running.
|
||||
|
||||
## Affected Textfiles
|
||||
|
||||
| File | LaunchAgent | What it monitors |
|
||||
|------|-------------|------------------|
|
||||
| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status |
|
||||
| `zot.prom` | `mcquack.eblume.zot` | Container registry |
|
||||
| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status |
|
||||
| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server |
|
||||
|
||||
## Diagnostic Steps
|
||||
|
||||
1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri:
|
||||
```fish
|
||||
ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/'
|
||||
```
|
||||
|
||||
2. **Check if the LaunchAgent is running**:
|
||||
```fish
|
||||
ssh indri 'launchctl list | grep mcquack'
|
||||
```
|
||||
|
||||
3. **Check LaunchAgent logs** (plist defines stdout/stderr paths):
|
||||
```fish
|
||||
ssh indri 'cat ~/Library/Logs/mcquack/<agent-name>.log'
|
||||
```
|
||||
|
||||
4. **Try running the exporter manually**:
|
||||
```fish
|
||||
ssh indri 'cat ~/Library/LaunchAgents/mcquack.<agent>.plist'
|
||||
# Find the ProgramArguments, run them manually
|
||||
```
|
||||
|
||||
## Common Causes
|
||||
|
||||
- **LaunchAgent not loaded** — `launchctl load ~/Library/LaunchAgents/mcquack.<agent>.plist`
|
||||
- **Script error** — the exporter script crashed; check logs
|
||||
- **Permissions** — the textfile directory is not writable
|
||||
- **Indri reboot** — some LaunchAgents may not auto-start
|
||||
|
||||
## Related
|
||||
|
||||
- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix`
|
||||
- [[deploy-infra-alerting]] — Alerting pipeline overview
|
||||
|
|
@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
|
|||
- [[runbook-service-probe-failure]] - Service health check failure runbook
|
||||
- [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
|
||||
- [[runbook-pod-not-ready]] - Pod not ready runbook
|
||||
- [[runbook-textfile-stale]] - Metrics textfile freshness runbook
|
||||
- [[runbook-frigate-camera-down]] - Frigate camera health runbook
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue