From 1d5990a2f704370351b7dcb7e8e188a398be782a Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 10:28:31 -0700 Subject: [PATCH] C2(deploy-infra-alerting): plan add alerting pipeline cards MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mikado chain for deploying Grafana Unified Alerting with ntfy notifications, replacing manual services-check probes. Chain: configure-grafana-alerting-pipeline → first-alert-and-runbook → port-services-check-alerts → refactor-services-check-to-query-alerts → deploy-infra-alerting (goal) Co-Authored-By: Claude Opus 4.6 (1M context) --- .../configure-grafana-alerting-pipeline.md | 60 ++++++++++++++ docs/how-to/alerts/deploy-infra-alerting.md | 81 +++++++++++++++++++ docs/how-to/alerts/first-alert-and-runbook.md | 70 ++++++++++++++++ .../alerts/port-services-check-alerts.md | 77 ++++++++++++++++++ ...refactor-services-check-to-query-alerts.md | 56 +++++++++++++ 5 files changed, 344 insertions(+) create mode 100644 docs/how-to/alerts/configure-grafana-alerting-pipeline.md create mode 100644 docs/how-to/alerts/deploy-infra-alerting.md create mode 100644 docs/how-to/alerts/first-alert-and-runbook.md create mode 100644 docs/how-to/alerts/port-services-check-alerts.md create mode 100644 docs/how-to/alerts/refactor-services-check-to-query-alerts.md diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md new file mode 100644 index 0000000..2c6999b --- /dev/null +++ b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md @@ -0,0 +1,60 @@ +--- +title: Configure Grafana Alerting Pipeline +modified: 2026-03-22 +status: active +tags: + - how-to + - alerting + - grafana +--- + +# Configure Grafana Alerting Pipeline + +Enable Grafana Unified Alerting, create an ntfy webhook contact point, configure the notification policy with anti-noise settings, and set up a message template with runbook links. + +## What to Do + +### 1. Enable Unified Alerting in grafana.ini + +Add the `[unified_alerting]` section to the Grafana ConfigMap. Grafana 11+ has unified alerting enabled by default, but we should be explicit and configure the evaluation interval. + +### 2. Create Alerting Provisioning Files + +Grafana supports provisioning alert resources via YAML files in `/etc/grafana/provisioning/alerting/`. Create: + +- **Contact point** — ntfy webhook targeting `http://ntfy.ntfy.svc.cluster.local:80/infra-alerts` (cluster-internal, since Grafana and ntfy are on different clusters, use `ntfy.ops.eblu.me` via Caddy instead) +- **Notification policy** — root policy with `group_wait: 1m`, `group_interval: 12h`, `repeat_interval: 24h`, grouped by `alertname` and `service` +- **Message template** — format that includes alert name, summary, and a clickable runbook URL as an ntfy action button + +### 3. Mount Provisioning into Grafana + +Add the alerting provisioning ConfigMap to the Grafana deployment, mounted at `/etc/grafana/provisioning/alerting/`. + +### 4. Create the `infra-alerts` Topic + +ntfy topics are created on first publish — no explicit setup needed. But verify that the topic works by sending a test notification. + +### 5. Verify End-to-End + +- Grafana UI shows the ntfy contact point under Alerting → Contact Points +- Notification policy shows the anti-noise settings +- Test notification from Grafana reaches the ntfy iOS app + +## Key Details + +- Grafana runs on minikube (indri), ntfy runs on k3s (ringtail). The contact point URL must go through Caddy: `https://ntfy.ops.eblu.me/infra-alerts` +- ntfy action buttons use the `X-Actions` header or JSON body format: `view, Open Runbook, ` +- Grafana provisioning files are applied on startup and cannot be edited from the UI (which is what we want for GitOps) + +## Verification + +- [ ] Grafana starts with unified alerting enabled +- [ ] Contact point `ntfy-infra` visible in Grafana UI +- [ ] Notification policy shows correct group/repeat intervals +- [ ] Test notification arrives on iOS via ntfy app +- [ ] Test notification includes a clickable runbook link + +## Related + +- [[deploy-infra-alerting]] — Parent goal +- [[first-alert-and-runbook]] — Next: create the first real alert diff --git a/docs/how-to/alerts/deploy-infra-alerting.md b/docs/how-to/alerts/deploy-infra-alerting.md new file mode 100644 index 0000000..7c2e7f0 --- /dev/null +++ b/docs/how-to/alerts/deploy-infra-alerting.md @@ -0,0 +1,81 @@ +--- +title: Deploy Infrastructure Alerting Pipeline +modified: 2026-03-22 +status: active +branch: mikado/deploy-infra-alerting +requires: + - refactor-services-check-to-query-alerts +tags: + - how-to + - alerting + - observability +--- + +# Deploy Infrastructure Alerting Pipeline + +Replace the manual `mise run services-check` approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links. + +## Architecture + +``` +Prometheus (metrics) ──┐ + ├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push +Loki (logs) ──────────┘ │ + │ + Notification Policy + (group_wait: 1m, + group_interval: 12h, + repeat_interval: 24h) +``` + +## Design Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed | +| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works | +| **Anti-noise** | 24h repeat interval | Page once per day max per alert group | +| **Runbooks** | `docs/how-to/alerts/.md` | Clickable link in every notification | +| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI | +| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience | + +## Alerting Policy + +- Each alert fires **once** and does not re-notify for 24 hours +- A "resolved" notification is sent when the condition clears +- Every alert annotation includes `runbook_url` linking to its how-to doc +- The ntfy message template renders the runbook URL as a clickable action button +- Alerts are grouped by service to avoid notification storms + +## Migration Path + +1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template +2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure) +3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook +4. Refactor services-check to query the Grafana alerting API instead of doing its own probes + +## What services-check Covers Today + +These checks will be migrated to alert rules: + +| Category | Checks | Data Source | +|----------|--------|-------------| +| Local services (indri) | forgejo, alloy, borgmatic, zot via brew/launchctl | Need new probes or textfile metrics | +| Metrics textfiles | freshness of `.prom` files | Existing node_textfile metrics | +| K8s cluster health | minikube API, k3s API | kube-state-metrics | +| HTTP endpoints | ~12 services via Caddy | Alloy blackbox exporter (already exists) | +| Ringtail | SSH, tailscale, k3s health | Need new probes | +| K3s pods | ntfy, authentik, frigate, etc. | kube-state-metrics on ringtail | +| Public services | docs, cv, forge via Fly.io | Alloy on Fly.io or external probe | +| PostgreSQL | CNPG readiness | CNPG metrics (already scraped) | +| ArgoCD sync | app sync/health status | ArgoCD metrics or API | + +## Related + +- [[configure-grafana-alerting-pipeline]] — Foundation: contact point, policy, template +- [[first-alert-and-runbook]] — Proof of concept alert +- [[port-services-check-alerts]] — Systematic migration +- [[refactor-services-check-to-query-alerts]] — Final integration +- [[observability]] — Current observability stack +- [[ntfy]] — Push notification service +- [[grafana]] — Dashboard and alerting platform diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/alerts/first-alert-and-runbook.md new file mode 100644 index 0000000..bec5aaa --- /dev/null +++ b/docs/how-to/alerts/first-alert-and-runbook.md @@ -0,0 +1,70 @@ +--- +title: First Alert and Runbook +modified: 2026-03-22 +status: active +requires: + - configure-grafana-alerting-pipeline +tags: + - how-to + - alerting +--- + +# First Alert and Runbook + +Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc. + +## What to Do + +### 1. Choose the First Alert + +The best candidate is a **blackbox probe failure** because: +- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals +- The metric `probe_success` is already in Prometheus +- It maps directly to what services-check does (HTTP health checks) +- A single alert rule with a `service` label can cover all probed services + +### 2. Create the Alert Rule + +Provision via YAML in the alerting provisioning ConfigMap. The rule should: +- Query `probe_success == 0` from Prometheus +- Fire after the condition persists for 2 minutes (avoid flapping) +- Include labels: `severity: warning`, `service: {{ $labels.instance }}` +- Include annotations: `summary`, `runbook_url` pointing to the runbook doc + +### 3. Create the Runbook + +Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining: +- What the alert means +- How to check which service is down +- Common causes and resolution steps +- How to silence the alert if the downtime is planned + +### 4. Verify End-to-End + +- Stop one of the probed services (e.g., scale miniflux to 0) +- Wait for the alert to fire (~2 minutes) +- Confirm ntfy notification arrives with correct summary and runbook link +- Click the runbook link and verify it reaches docs.eblu.me +- Scale the service back up +- Confirm "resolved" notification arrives +- Confirm no repeat notification during the 24h window + +## Key Details + +- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies +- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name +- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure` + +## Verification + +- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules +- [ ] Simulated failure triggers ntfy notification within ~3 minutes +- [ ] Notification includes service name, summary, and clickable runbook link +- [ ] Resolution triggers a "resolved" notification +- [ ] No repeat notification within 24h window + +## Related + +- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working +- [[deploy-infra-alerting]] — Parent goal +- [[port-services-check-alerts]] — Next: port remaining checks diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/alerts/port-services-check-alerts.md new file mode 100644 index 0000000..807c340 --- /dev/null +++ b/docs/how-to/alerts/port-services-check-alerts.md @@ -0,0 +1,77 @@ +--- +title: Port services-check Alerts to Grafana +modified: 2026-03-22 +status: active +requires: + - first-alert-and-runbook +tags: + - how-to + - alerting +--- + +# Port services-check Alerts to Grafana + +Systematically migrate the health checks from `mise run services-check` to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today. + +## What to Do + +### 1. Inventory and Prioritize + +Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation. + +**Already have metrics (easy):** +- HTTP endpoint probes → Alloy blackbox exporter (`probe_success`) +- PostgreSQL health → CNPG metrics (`cnpg_pg_replication_streaming`, `cnpg_collector_up`) +- K8s pod health → kube-state-metrics (`kube_pod_status_phase`) +- ArgoCD sync status → ArgoCD metrics (`argocd_app_info` with sync/health labels) + +**Need new probes or metrics:** +- Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes +- Metrics textfile freshness → `node_textfile_mtime_seconds` (already collected by Alloy on indri) +- Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe +- Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring + +### 2. Add Missing Probes + +Extend Alloy configurations where needed: +- **Alloy on indri:** Add blackbox targets for forgejo, zot (local HTTP endpoints) +- **Alloy on ringtail:** Add blackbox targets for ringtail-local services +- **Consider:** Whether public endpoint probing belongs in Fly.io Alloy or a separate prober + +### 3. Create Alert Rules + +For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints"). + +### 4. Create Runbooks + +One runbook per alert type in `docs/how-to/alerts/runbook-.md`. Each runbook should cover: +- What the alert means +- Diagnostic steps +- Common fixes +- How to silence for planned maintenance + +### 5. Remove from services-check + +As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows. + +## Key Details + +- Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain +- Prioritize checks that have caught real problems in the past +- Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases +- The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward + +## Verification + +- [ ] All HTTP endpoint checks from services-check have corresponding alert rules +- [ ] Pod health checks have corresponding alert rules +- [ ] PostgreSQL health has a corresponding alert rule +- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/` +- [ ] Test at least 2-3 failure scenarios end-to-end +- [ ] services-check script has been updated to reflect ported checks + +## Related + +- [[first-alert-and-runbook]] — Prerequisite: established the pattern +- [[deploy-infra-alerting]] — Parent goal +- [[refactor-services-check-to-query-alerts]] — Next: make services-check query alerts diff --git a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md new file mode 100644 index 0000000..640bcff --- /dev/null +++ b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md @@ -0,0 +1,56 @@ +--- +title: Refactor services-check to Query Alerts +modified: 2026-03-22 +status: active +requires: + - port-services-check-alerts +tags: + - how-to + - alerting +--- + +# Refactor services-check to Query Alerts + +Change `mise run services-check` from doing its own health probes to querying the Grafana alerting API for currently firing alerts. The script becomes a CLI view into the same alerting system that sends ntfy notifications. + +## What to Do + +### 1. Query the Grafana Alerting API + +Grafana exposes alert state via: +- `GET /api/v1/provisioning/alert-rules` — all configured rules +- `GET /api/prometheus/grafana/api/v1/alerts` — currently firing alerts (Prometheus-compatible format) + +The second endpoint is simpler — it returns only active alerts with labels and annotations, similar to Alertmanager's `/api/v1/alerts`. + +### 2. Rewrite services-check + +The new services-check should: +1. Query the Grafana alerting API for firing alerts +2. Display them in a table with service name, alert name, duration, and runbook link +3. If no alerts are firing, print a green "all clear" message +4. Exit 0 if no alerts, exit 1 if any are firing +5. Optionally keep a few checks that don't map to alerting (e.g., the ArgoCD sync status table as a summary view) + +### 3. Handle Authentication + +services-check will need a Grafana API token or service account token. Options: +- Use the existing Grafana admin credentials from 1Password (`op read`) +- Create a dedicated read-only service account in Grafana + +### 4. Preserve the ArgoCD Summary + +The ArgoCD sync/health table in services-check is a useful quick view even when nothing is alerting. Consider keeping it as a separate section that always displays, independent of the alert query. + +## Verification + +- [ ] `mise run services-check` queries Grafana instead of doing direct probes +- [ ] Firing alerts are displayed with service name, alert name, and runbook link +- [ ] Exit code reflects alert state (0 = clear, 1 = firing) +- [ ] Works when Grafana is unreachable (graceful error, not a crash) +- [ ] ArgoCD summary table still works + +## Related + +- [[port-services-check-alerts]] — Prerequisite: alerts must exist to query +- [[deploy-infra-alerting]] — Parent goal