C2(deploy-infra-alerting): impl add first alert rule and runbook

- Add ServiceProbeFailure alert rule to Grafana alerting provisioning
  - Queries probe_success metric from Alloy blackbox exporter
  - Extracts service name from job label via label_replace
  - Fires after 2 minutes of failure, noDataState=Alerting
  - Annotations include summary with service name and runbook URL
- Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md
  - Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd)
  - Diagnostic steps, common causes, silencing instructions
- Add alerting section to observability.md reference doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-03-22 10:57:23 -07:00
commit 549c57ab82
3 changed files with 130 additions and 1 deletions

View file

@ -26,6 +26,55 @@ policies:
group_interval: 12h
repeat_interval: 24h
groups:
- orgId: 1
name: service-health
folder: Infrastructure Alerts
interval: 30s
rules:
- uid: service-probe-failure
title: ServiceProbeFailure
condition: B
for: 2m
noDataState: Alerting
execErrState: Alerting
annotations:
summary: >-
{{ index $labels "service" }} health check is failing
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure
labels:
severity: warning
data:
- refId: A
datasourceUid: prometheus
relativeTimeRange:
from: 300
to: 0
model:
expr: >-
label_replace(probe_success, "service",
"$1", "job", "integrations/blackbox/(.*)")
interval: ""
refId: A
- refId: B
datasourceUid: "__expr__"
relativeTimeRange:
from: 0
to: 0
model:
type: threshold
expression: A
conditions:
- evaluator:
type: lt
params:
- 1
operator:
type: and
reducer:
type: last
refId: B
templates:
- orgId: 1
name: ntfy-infra

View file

@ -0,0 +1,75 @@
---
title: "Runbook: Service Probe Failure"
modified: 2026-03-22
tags:
- how-to
- alerting
- runbook
---
# Runbook: Service Probe Failure
**Alert name:** `ServiceProbeFailure`
A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.
## Affected Services
This alert covers services probed by the Alloy blackbox exporter on indri's minikube cluster:
| Service | Health Endpoint |
|---------|----------------|
| miniflux | `/healthcheck` |
| kiwix | `/` |
| transmission | `/transmission/web/` |
| devpi | `/+api` |
| argocd | `/healthz` |
The failing service is identified by the `service` label in the alert (extracted from the `job` label).
## Diagnostic Steps
1. **Check which service is down** — the alert label `service` tells you. You can also run:
```fish
kubectl get pods -n <namespace> --context=minikube-indri
```
2. **Check pod status** — look for CrashLoopBackOff, OOMKilled, or pending pods:
```fish
kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri
```
3. **Check pod logs**:
```fish
kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50
```
4. **Check if minikube itself is healthy**:
```fish
ssh indri 'minikube status'
```
5. **Check NFS mounts** (kiwix, transmission depend on sifaka NFS):
```fish
ssh indri 'df -h | grep Volumes'
```
## Common Causes
- **Pod crashed** — check logs, restart with `kubectl delete pod`
- **NFS mount lost** — sifaka offline or AutoMounter not running. SSH to indri and check `/Volumes/`
- **Resource exhaustion** — check `kubectl top pods -n <namespace>` for memory/CPU pressure
- **Minikube paused/stopped**`ssh indri 'minikube status'`, restart if needed
## Silencing
For planned maintenance, silence this alert in Grafana:
1. Go to Alerting → Silences → Create Silence
2. Match label `alertname = ServiceProbeFailure`
3. Optionally match `service = <specific-service>` to silence only one
4. Set duration for your maintenance window
## Related
- [[deploy-infra-alerting]] — Alerting pipeline overview
- [[configure-grafana-alerting-pipeline]] — Pipeline configuration

View file

@ -1,6 +1,6 @@
---
title: Observability
modified: 2026-02-07
modified: 2026-03-22
tags:
- operations
---
@ -16,3 +16,8 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
- [[tempo]] - Distributed tracing
- [[alloy|Alloy]] - Metrics, log, and trace collection
- [[grafana]] - Dashboards and visualization
## Alerting
- [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy)
- [[runbook-service-probe-failure]] - Service health check failure runbook