Remove all Mikado frontmatter (status, branch, requires) from chain
cards. Rename docs/how-to/alerts/ to docs/how-to/runbooks/ and update
all runbook_url references. Add changelog fragment.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query all textfile mtimes (time() - node_textfile_mtime_seconds) and
threshold at > 3600s, instead of filtering with > 3600 which returns
empty results when everything is fresh.
This means:
- Fresh textfiles: query returns low values, threshold not met → OK
- Stale textfiles: query returns high values, threshold met → Alerting
- Missing textfiles: series vanishes, noDataState=Alerting → Alerting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same pattern as PodNotReady: when no textfiles are stale, the query
returns no data. noDataState=Alerting incorrectly treats this as a
problem. Changed to OK.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Grafana's k8s Service maps port 80 → container port 3000. The
blackbox probe was targeting port 3000 directly on the Service
ClusterIP, which doesn't work — connection refused.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082)
- Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has
sync_status != Synced for 30 minutes
- Add runbook with diagnostic steps and common fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TextfileStale: fires when a .prom textfile on indri hasn't been
updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic,
zot, minikube, jellyfin exporters.
- FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m.
- Add runbooks for both alerts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No unhealthy pods = no query results = noData state. With noDataState
set to NoData, Grafana fires an alert with empty labels ("Pod in is
not ready"). Change to OK since no results means everything is healthy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CronJob pods (e.g., zim-watcher) are expected to complete and become
not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the correct provisioning field name for Grafana webhook custom
payloads: settings.payload.template (not payloadTemplate).
Found by reading the Go source (grafana/alerting receivers/webhook/v1/config.go):
Payload *CustomPayload `json:"payload,omitempty"`
CustomPayload.Template string `json:"template,omitempty"`
The template uses coll.Dict, coll.Append, and data.ToJSON to produce
ntfy-native JSON with topic, title, message, priority, and action
buttons linking to runbooks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add reduce step between Prometheus query and threshold to preserve
per-service labels. Without it, Grafana can't distinguish the 5
probe_success series and errors with "duplicate results with labels {}".
Chain: A (prometheus query) → B (reduce last) → C (threshold < 1)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ServiceProbeFailure alert rule to Grafana alerting provisioning
- Queries probe_success metric from Alloy blackbox exporter
- Extracts service name from job label via label_replace
- Fires after 2 minutes of failure, noDataState=Alerting
- Annotations include summary with service name and runbook URL
- Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md
- Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd)
- Diagnostic steps, common causes, silencing instructions
- Add alerting section to observability.md reference doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>