C2(deploy-infra-alerting): plan add alerting pipeline cards

Mikado chain for deploying Grafana Unified Alerting with ntfy
notifications, replacing manual services-check probes.

Chain: configure-grafana-alerting-pipeline
     → first-alert-and-runbook
     → port-services-check-alerts
     → refactor-services-check-to-query-alerts
     → deploy-infra-alerting (goal)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-03-22 10:28:31 -07:00
commit 1d5990a2f7
5 changed files with 344 additions and 0 deletions

View file

@ -0,0 +1,60 @@
---
title: Configure Grafana Alerting Pipeline
modified: 2026-03-22
status: active
tags:
- how-to
- alerting
- grafana
---
# Configure Grafana Alerting Pipeline
Enable Grafana Unified Alerting, create an ntfy webhook contact point, configure the notification policy with anti-noise settings, and set up a message template with runbook links.
## What to Do
### 1. Enable Unified Alerting in grafana.ini
Add the `[unified_alerting]` section to the Grafana ConfigMap. Grafana 11+ has unified alerting enabled by default, but we should be explicit and configure the evaluation interval.
### 2. Create Alerting Provisioning Files
Grafana supports provisioning alert resources via YAML files in `/etc/grafana/provisioning/alerting/`. Create:
- **Contact point** — ntfy webhook targeting `http://ntfy.ntfy.svc.cluster.local:80/infra-alerts` (cluster-internal, since Grafana and ntfy are on different clusters, use `ntfy.ops.eblu.me` via Caddy instead)
- **Notification policy** — root policy with `group_wait: 1m`, `group_interval: 12h`, `repeat_interval: 24h`, grouped by `alertname` and `service`
- **Message template** — format that includes alert name, summary, and a clickable runbook URL as an ntfy action button
### 3. Mount Provisioning into Grafana
Add the alerting provisioning ConfigMap to the Grafana deployment, mounted at `/etc/grafana/provisioning/alerting/`.
### 4. Create the `infra-alerts` Topic
ntfy topics are created on first publish — no explicit setup needed. But verify that the topic works by sending a test notification.
### 5. Verify End-to-End
- Grafana UI shows the ntfy contact point under Alerting → Contact Points
- Notification policy shows the anti-noise settings
- Test notification from Grafana reaches the ntfy iOS app
## Key Details
- Grafana runs on minikube (indri), ntfy runs on k3s (ringtail). The contact point URL must go through Caddy: `https://ntfy.ops.eblu.me/infra-alerts`
- ntfy action buttons use the `X-Actions` header or JSON body format: `view, Open Runbook, <url>`
- Grafana provisioning files are applied on startup and cannot be edited from the UI (which is what we want for GitOps)
## Verification
- [ ] Grafana starts with unified alerting enabled
- [ ] Contact point `ntfy-infra` visible in Grafana UI
- [ ] Notification policy shows correct group/repeat intervals
- [ ] Test notification arrives on iOS via ntfy app
- [ ] Test notification includes a clickable runbook link
## Related
- [[deploy-infra-alerting]] — Parent goal
- [[first-alert-and-runbook]] — Next: create the first real alert

View file

@ -0,0 +1,81 @@
---
title: Deploy Infrastructure Alerting Pipeline
modified: 2026-03-22
status: active
branch: mikado/deploy-infra-alerting
requires:
- refactor-services-check-to-query-alerts
tags:
- how-to
- alerting
- observability
---
# Deploy Infrastructure Alerting Pipeline
Replace the manual `mise run services-check` approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links.
## Architecture
```
Prometheus (metrics) ──┐
├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push
Loki (logs) ──────────┘ │
Notification Policy
(group_wait: 1m,
group_interval: 12h,
repeat_interval: 24h)
```
## Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed |
| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works |
| **Anti-noise** | 24h repeat interval | Page once per day max per alert group |
| **Runbooks** | `docs/how-to/alerts/<name>.md` | Clickable link in every notification |
| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI |
| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience |
## Alerting Policy
- Each alert fires **once** and does not re-notify for 24 hours
- A "resolved" notification is sent when the condition clears
- Every alert annotation includes `runbook_url` linking to its how-to doc
- The ntfy message template renders the runbook URL as a clickable action button
- Alerts are grouped by service to avoid notification storms
## Migration Path
1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template
2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure)
3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook
4. Refactor services-check to query the Grafana alerting API instead of doing its own probes
## What services-check Covers Today
These checks will be migrated to alert rules:
| Category | Checks | Data Source |
|----------|--------|-------------|
| Local services (indri) | forgejo, alloy, borgmatic, zot via brew/launchctl | Need new probes or textfile metrics |
| Metrics textfiles | freshness of `.prom` files | Existing node_textfile metrics |
| K8s cluster health | minikube API, k3s API | kube-state-metrics |
| HTTP endpoints | ~12 services via Caddy | Alloy blackbox exporter (already exists) |
| Ringtail | SSH, tailscale, k3s health | Need new probes |
| K3s pods | ntfy, authentik, frigate, etc. | kube-state-metrics on ringtail |
| Public services | docs, cv, forge via Fly.io | Alloy on Fly.io or external probe |
| PostgreSQL | CNPG readiness | CNPG metrics (already scraped) |
| ArgoCD sync | app sync/health status | ArgoCD metrics or API |
## Related
- [[configure-grafana-alerting-pipeline]] — Foundation: contact point, policy, template
- [[first-alert-and-runbook]] — Proof of concept alert
- [[port-services-check-alerts]] — Systematic migration
- [[refactor-services-check-to-query-alerts]] — Final integration
- [[observability]] — Current observability stack
- [[ntfy]] — Push notification service
- [[grafana]] — Dashboard and alerting platform

View file

@ -0,0 +1,70 @@
---
title: First Alert and Runbook
modified: 2026-03-22
status: active
requires:
- configure-grafana-alerting-pipeline
tags:
- how-to
- alerting
---
# First Alert and Runbook
Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.
## What to Do
### 1. Choose the First Alert
The best candidate is a **blackbox probe failure** because:
- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
- The metric `probe_success` is already in Prometheus
- It maps directly to what services-check does (HTTP health checks)
- A single alert rule with a `service` label can cover all probed services
### 2. Create the Alert Rule
Provision via YAML in the alerting provisioning ConfigMap. The rule should:
- Query `probe_success == 0` from Prometheus
- Fire after the condition persists for 2 minutes (avoid flapping)
- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
- Include annotations: `summary`, `runbook_url` pointing to the runbook doc
### 3. Create the Runbook
Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining:
- What the alert means
- How to check which service is down
- Common causes and resolution steps
- How to silence the alert if the downtime is planned
### 4. Verify End-to-End
- Stop one of the probed services (e.g., scale miniflux to 0)
- Wait for the alert to fire (~2 minutes)
- Confirm ntfy notification arrives with correct summary and runbook link
- Click the runbook link and verify it reaches docs.eblu.me
- Scale the service back up
- Confirm "resolved" notification arrives
- Confirm no repeat notification during the 24h window
## Key Details
- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure`
## Verification
- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules
- [ ] Simulated failure triggers ntfy notification within ~3 minutes
- [ ] Notification includes service name, summary, and clickable runbook link
- [ ] Resolution triggers a "resolved" notification
- [ ] No repeat notification within 24h window
## Related
- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
- [[deploy-infra-alerting]] — Parent goal
- [[port-services-check-alerts]] — Next: port remaining checks

View file

@ -0,0 +1,77 @@
---
title: Port services-check Alerts to Grafana
modified: 2026-03-22
status: active
requires:
- first-alert-and-runbook
tags:
- how-to
- alerting
---
# Port services-check Alerts to Grafana
Systematically migrate the health checks from `mise run services-check` to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today.
## What to Do
### 1. Inventory and Prioritize
Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation.
**Already have metrics (easy):**
- HTTP endpoint probes → Alloy blackbox exporter (`probe_success`)
- PostgreSQL health → CNPG metrics (`cnpg_pg_replication_streaming`, `cnpg_collector_up`)
- K8s pod health → kube-state-metrics (`kube_pod_status_phase`)
- ArgoCD sync status → ArgoCD metrics (`argocd_app_info` with sync/health labels)
**Need new probes or metrics:**
- Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes
- Metrics textfile freshness → `node_textfile_mtime_seconds` (already collected by Alloy on indri)
- Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe
- Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring
### 2. Add Missing Probes
Extend Alloy configurations where needed:
- **Alloy on indri:** Add blackbox targets for forgejo, zot (local HTTP endpoints)
- **Alloy on ringtail:** Add blackbox targets for ringtail-local services
- **Consider:** Whether public endpoint probing belongs in Fly.io Alloy or a separate prober
### 3. Create Alert Rules
For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints").
### 4. Create Runbooks
One runbook per alert type in `docs/how-to/alerts/runbook-<name>.md`. Each runbook should cover:
- What the alert means
- Diagnostic steps
- Common fixes
- How to silence for planned maintenance
### 5. Remove from services-check
As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows.
## Key Details
- Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain
- Prioritize checks that have caught real problems in the past
- Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases
- The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward
## Verification
- [ ] All HTTP endpoint checks from services-check have corresponding alert rules
- [ ] Pod health checks have corresponding alert rules
- [ ] PostgreSQL health has a corresponding alert rule
- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/`
- [ ] Test at least 2-3 failure scenarios end-to-end
- [ ] services-check script has been updated to reflect ported checks
## Related
- [[first-alert-and-runbook]] — Prerequisite: established the pattern
- [[deploy-infra-alerting]] — Parent goal
- [[refactor-services-check-to-query-alerts]] — Next: make services-check query alerts

View file

@ -0,0 +1,56 @@
---
title: Refactor services-check to Query Alerts
modified: 2026-03-22
status: active
requires:
- port-services-check-alerts
tags:
- how-to
- alerting
---
# Refactor services-check to Query Alerts
Change `mise run services-check` from doing its own health probes to querying the Grafana alerting API for currently firing alerts. The script becomes a CLI view into the same alerting system that sends ntfy notifications.
## What to Do
### 1. Query the Grafana Alerting API
Grafana exposes alert state via:
- `GET /api/v1/provisioning/alert-rules` — all configured rules
- `GET /api/prometheus/grafana/api/v1/alerts` — currently firing alerts (Prometheus-compatible format)
The second endpoint is simpler — it returns only active alerts with labels and annotations, similar to Alertmanager's `/api/v1/alerts`.
### 2. Rewrite services-check
The new services-check should:
1. Query the Grafana alerting API for firing alerts
2. Display them in a table with service name, alert name, duration, and runbook link
3. If no alerts are firing, print a green "all clear" message
4. Exit 0 if no alerts, exit 1 if any are firing
5. Optionally keep a few checks that don't map to alerting (e.g., the ArgoCD sync status table as a summary view)
### 3. Handle Authentication
services-check will need a Grafana API token or service account token. Options:
- Use the existing Grafana admin credentials from 1Password (`op read`)
- Create a dedicated read-only service account in Grafana
### 4. Preserve the ArgoCD Summary
The ArgoCD sync/health table in services-check is a useful quick view even when nothing is alerting. Consider keeping it as a separate section that always displays, independent of the alert query.
## Verification
- [ ] `mise run services-check` queries Grafana instead of doing direct probes
- [ ] Firing alerts are displayed with service name, alert name, and runbook link
- [ ] Exit code reflects alert state (0 = clear, 1 = firing)
- [ ] Works when Grafana is unreachable (graceful error, not a crash)
- [ ] ArgoCD summary table still works
## Related
- [[port-services-check-alerts]] — Prerequisite: alerts must exist to query
- [[deploy-infra-alerting]] — Parent goal