C2: Deploy infrastructure alerting pipeline

eblume commented

2026-03-22 10:29:16 -07:00

Owner

Summary

Mikado chain to replace mise run services-check with Grafana Unified Alerting backed by ntfy push notifications.

Design:

Grafana Unified Alerting evaluates rules against Prometheus/Loki
ntfy webhook contact point delivers iOS notifications
Anti-noise policy: page once per 24h per alert group
Every alert links to a runbook in docs/how-to/alerts/
services-check eventually queries the alerting API instead of doing its own probes

Chain (bottom-up):

configure-grafana-alerting-pipeline — enable alerting, ntfy contact point, notification policy
first-alert-and-runbook — end-to-end proof of concept with blackbox probe failure
port-services-check-alerts — migrate all services-check probes to alert rules + runbooks
refactor-services-check-to-query-alerts — rewrite services-check to query Grafana API
deploy-infra-alerting — goal card

🤖 Generated with Claude Code

## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. **Design:** - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes **Chain (bottom-up):** 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code)

eblume added 1 commit

2026-03-22 10:29:17 -07:00

C2(deploy-infra-alerting): plan add alerting pipeline cards 1d5990a2f7

Mikado chain for deploying Grafana Unified Alerting with ntfy
notifications, replacing manual services-check probes.

Chain: configure-grafana-alerting-pipeline
     → first-alert-and-runbook
     → port-services-check-alerts
     → refactor-services-check-to-query-alerts
     → deploy-infra-alerting (goal)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 10:35:59 -07:00

C2(deploy-infra-alerting): impl configure grafana alerting pipeline 261f20601a

- Enable unified alerting in grafana.ini
- Create alerting.yaml provisioning file with:
  - ntfy-infra webhook contact point (POST to ntfy.ops.eblu.me/infra-alerts)
  - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h
  - Message templates for title and runbook links
- Mount alerting provisioning into Grafana deployment
- Add alerting.yaml to kustomization configMapGenerator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 10:51:13 -07:00

C2(deploy-infra-alerting): close configure-grafana-alerting-pipeline c1acc808d5

Pipeline verified:
- Grafana unified alerting enabled with provisioned contact point and policy
- ntfy webhook contact point delivering to infra-alerts topic
- Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h
- iOS push notifications confirmed working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 10:57:32 -07:00

C2(deploy-infra-alerting): impl add first alert rule and runbook 549c57ab82

- Add ServiceProbeFailure alert rule to Grafana alerting provisioning
  - Queries probe_success metric from Alloy blackbox exporter
  - Extracts service name from job label via label_replace
  - Fires after 2 minutes of failure, noDataState=Alerting
  - Annotations include summary with service name and runbook URL
- Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md
  - Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd)
  - Diagnostic steps, common causes, silencing instructions
- Add alerting section to observability.md reference doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 11:00:32 -07:00

C2(deploy-infra-alerting): impl fix alert rule multi-series evaluation 94413f73ba

Add reduce step between Prometheus query and threshold to preserve
per-service labels. Without it, Grafana can't distinguish the 5
probe_success series and errors with "duplicate results with labels {}".

Chain: A (prometheus query) → B (reduce last) → C (threshold < 1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 11:18:59 -07:00

C2(deploy-infra-alerting): impl use custom payload for ntfy-native JSON e7f50bc4fd

Replace Grafana's default webhook JSON with ntfy-native JSON via
payloadTemplate. The template produces:
  {"topic":"infra-alerts","title":"[FIRING] ...","message":"...","actions":[...]}

This gives clean notifications instead of raw Grafana JSON blobs.
Uses coll.Dict/data.ToJSON template functions (Grafana 12+).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 11:26:55 -07:00

C2(deploy-infra-alerting): impl fix ntfy payload template syntax ae67899bcf

Use confirmed Grafana 12 template functions (coll.Dict, coll.Slice,
coll.Append, data.ToJSON) to build ntfy-native JSON payload.
Previous attempt hit 400 from ntfy — likely malformed output from
incorrect template syntax. This version uses the documented patterns
from Grafana's notification template reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 11:34:02 -07:00

TEMP: debug webhook URL 03a87a40bb

eblume added 1 commit

2026-03-22 11:46:02 -07:00

TEMP: debug intervals 64ae12ad71

eblume force-pushed mikado/deploy-infra-alerting from 64ae12ad71 to 4c0bd0055f

2026-03-22 11:59:40 -07:00

Compare

eblume added 1 commit

2026-03-22 12:05:52 -07:00

C2(deploy-infra-alerting): close first-alert-and-runbook e33b0bc184

End-to-end alerting pipeline verified:
- ServiceProbeFailure alert rule evaluates 5 blackbox probes
- Grafana custom payload produces ntfy-native JSON (topic, title,
  message, priority, actions)
- Firing notification arrives on iOS with clean formatting
- "Open Runbook" action button links to docs.eblu.me runbook
- Resolved notification delivered on recovery

Key discovery: Grafana webhook custom payload provisioning field is
settings.payload.template (nested object), not payloadTemplate
(flat string). Found by reading grafana/alerting Go source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 12:11:23 -07:00

C2(deploy-infra-alerting): impl add probes and alert rules for services-check coverage 8e6a803076

Extend Alloy blackbox probes:
- Add prometheus, loki, grafana, teslamate, immich, navidrome
- Now probing 11 services (was 5), covering most HTTP checks from
  services-check

Add alert rules:
- PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical)
- PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m

Add runbooks:
- runbook-postgres-unhealthy.md
- runbook-pod-not-ready.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 12:19:39 -07:00

C2(deploy-infra-alerting): impl exclude Job-owned pods from PodNotReady 02e07aeb41

CronJob pods (e.g., zim-watcher) are expected to complete and become
not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 13:35:33 -07:00

C2(deploy-infra-alerting): impl fix PodNotReady noDataState to OK b2b0d6efa7

No unhealthy pods = no query results = noData state. With noDataState
set to NoData, Grafana fires an alert with empty labels ("Pod in is
not ready"). Change to OK since no results means everything is healthy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 13:43:23 -07:00

C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts 2fa536e547

- TextfileStale: fires when a .prom textfile on indri hasn't been
  updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic,
  zot, minikube, jellyfin exporters.
- FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m.
- Add runbooks for both alerts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 13:45:42 -07:00

C2(deploy-infra-alerting): impl add ArgoCD scrape and sync alert 957ee90fa2

- Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082)
- Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has
  sync_status != Synced for 30 minutes
- Add runbook with diagnostic steps and common fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 14:05:37 -07:00

C2(deploy-infra-alerting): impl fix Grafana probe port (80 not 3000) d9ab004479

Grafana's k8s Service maps port 80 → container port 3000. The
blackbox probe was targeting port 3000 directly on the Service
ClusterIP, which doesn't work — connection refused.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 14:10:10 -07:00

C2(deploy-infra-alerting): impl fix TextfileStale noDataState to OK da452e2bf5

Same pattern as PodNotReady: when no textfiles are stale, the query
returns no data. noDataState=Alerting incorrectly treats this as a
problem. Changed to OK.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 14:13:26 -07:00

C2(deploy-infra-alerting): impl fix TextfileStale to always return data cdd85c7ac9

Query all textfile mtimes (time() - node_textfile_mtime_seconds) and
threshold at > 3600s, instead of filtering with > 3600 which returns
empty results when everything is fresh.

This means:
- Fresh textfiles: query returns low values, threshold not met → OK
- Stale textfiles: query returns high values, threshold met → Alerting
- Missing textfiles: series vanishes, noDataState=Alerting → Alerting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 14:21:51 -07:00

C2(deploy-infra-alerting): impl refactor services-check to query alerts 52eed44542

Replace covered checks with Grafana alerting API queries:
- ServiceProbeFailure: 11 HTTP endpoints
- TextfileStale: metrics textfile freshness
- FrigateCameraDown: camera FPS
- PodNotReady: pod readiness (both clusters)
- PostgresClusterUnhealthy: database health
- ArgoCDAppOutOfSync: ArgoCD sync status

Uncovered checks remain as direct probes (SSH, launchctl, public
endpoints, k8s API, frigate storage, some HTTP endpoints).

Firing alerts display summary and clickable runbook link.
Grafana credentials fetched from 1Password; graceful fallback
if unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 2 commits

2026-03-22 14:24:50 -07:00

C2(deploy-infra-alerting): close port-services-check-alerts 2e2a33d7ca

7 alert rules covering services-check probes:
- ServiceProbeFailure (11 HTTP probes via Alloy blackbox)
- PodNotReady (kube-state-metrics, both clusters)
- PostgresClusterUnhealthy (CNPG collector)
- TextfileStale (node_textfile_mtime_seconds)
- FrigateCameraDown (frigate_camera_fps)
- ArgoCDAppOutOfSync (argocd_app_info)

7 runbooks in docs/how-to/alerts/.

Remaining uncovered: local indri services (brew/launchctl), ringtail
SSH/tailscale, public Fly.io endpoints, k8s API health, frigate
storage. These are effectively covered by downstream alerts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

C2(deploy-infra-alerting): close refactor-services-check-to-query-alerts d53e6244a0

services-check now queries Grafana alerting API for covered checks,
displaying firing alerts with summary and runbook links. Uncovered
checks remain as direct probes. Grafana credentials from 1Password
with graceful fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume added 1 commit

2026-03-22 14:29:33 -07:00

C2(deploy-infra-alerting): finalize rewrite cards as historical docs c22f9db1c8

Remove all Mikado frontmatter (status, branch, requires) from chain
cards. Add changelog fragment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eblume force-pushed mikado/deploy-infra-alerting from c22f9db1c8 to 67883950c3

2026-03-22 14:41:03 -07:00

Compare

eblume merged commit 6d65e6928c into main

2026-03-22 14:52:56 -07:00

eblume referenced this pull request from a commit

2026-03-22 14:52:58 -07:00

C2: Deploy infrastructure alerting pipeline (#303)

Rows
Columns

C2: Deploy infrastructure alerting pipeline #303

Summary