C2(deploy-infra-alerting): plan add alerting pipeline cards
Mikado chain for deploying Grafana Unified Alerting with ntfy
notifications, replacing manual services-check probes.
Chain: configure-grafana-alerting-pipeline
→ first-alert-and-runbook
→ port-services-check-alerts
→ refactor-services-check-to-query-alerts
→ deploy-infra-alerting (goal)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f1620abb17
commit
1d5990a2f7
5 changed files with 344 additions and 0 deletions
60
docs/how-to/alerts/configure-grafana-alerting-pipeline.md
Normal file
60
docs/how-to/alerts/configure-grafana-alerting-pipeline.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
---
|
||||
title: Configure Grafana Alerting Pipeline
|
||||
modified: 2026-03-22
|
||||
status: active
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
- grafana
|
||||
---
|
||||
|
||||
# Configure Grafana Alerting Pipeline
|
||||
|
||||
Enable Grafana Unified Alerting, create an ntfy webhook contact point, configure the notification policy with anti-noise settings, and set up a message template with runbook links.
|
||||
|
||||
## What to Do
|
||||
|
||||
### 1. Enable Unified Alerting in grafana.ini
|
||||
|
||||
Add the `[unified_alerting]` section to the Grafana ConfigMap. Grafana 11+ has unified alerting enabled by default, but we should be explicit and configure the evaluation interval.
|
||||
|
||||
### 2. Create Alerting Provisioning Files
|
||||
|
||||
Grafana supports provisioning alert resources via YAML files in `/etc/grafana/provisioning/alerting/`. Create:
|
||||
|
||||
- **Contact point** — ntfy webhook targeting `http://ntfy.ntfy.svc.cluster.local:80/infra-alerts` (cluster-internal, since Grafana and ntfy are on different clusters, use `ntfy.ops.eblu.me` via Caddy instead)
|
||||
- **Notification policy** — root policy with `group_wait: 1m`, `group_interval: 12h`, `repeat_interval: 24h`, grouped by `alertname` and `service`
|
||||
- **Message template** — format that includes alert name, summary, and a clickable runbook URL as an ntfy action button
|
||||
|
||||
### 3. Mount Provisioning into Grafana
|
||||
|
||||
Add the alerting provisioning ConfigMap to the Grafana deployment, mounted at `/etc/grafana/provisioning/alerting/`.
|
||||
|
||||
### 4. Create the `infra-alerts` Topic
|
||||
|
||||
ntfy topics are created on first publish — no explicit setup needed. But verify that the topic works by sending a test notification.
|
||||
|
||||
### 5. Verify End-to-End
|
||||
|
||||
- Grafana UI shows the ntfy contact point under Alerting → Contact Points
|
||||
- Notification policy shows the anti-noise settings
|
||||
- Test notification from Grafana reaches the ntfy iOS app
|
||||
|
||||
## Key Details
|
||||
|
||||
- Grafana runs on minikube (indri), ntfy runs on k3s (ringtail). The contact point URL must go through Caddy: `https://ntfy.ops.eblu.me/infra-alerts`
|
||||
- ntfy action buttons use the `X-Actions` header or JSON body format: `view, Open Runbook, <url>`
|
||||
- Grafana provisioning files are applied on startup and cannot be edited from the UI (which is what we want for GitOps)
|
||||
|
||||
## Verification
|
||||
|
||||
- [ ] Grafana starts with unified alerting enabled
|
||||
- [ ] Contact point `ntfy-infra` visible in Grafana UI
|
||||
- [ ] Notification policy shows correct group/repeat intervals
|
||||
- [ ] Test notification arrives on iOS via ntfy app
|
||||
- [ ] Test notification includes a clickable runbook link
|
||||
|
||||
## Related
|
||||
|
||||
- [[deploy-infra-alerting]] — Parent goal
|
||||
- [[first-alert-and-runbook]] — Next: create the first real alert
|
||||
81
docs/how-to/alerts/deploy-infra-alerting.md
Normal file
81
docs/how-to/alerts/deploy-infra-alerting.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
---
|
||||
title: Deploy Infrastructure Alerting Pipeline
|
||||
modified: 2026-03-22
|
||||
status: active
|
||||
branch: mikado/deploy-infra-alerting
|
||||
requires:
|
||||
- refactor-services-check-to-query-alerts
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
- observability
|
||||
---
|
||||
|
||||
# Deploy Infrastructure Alerting Pipeline
|
||||
|
||||
Replace the manual `mise run services-check` approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Prometheus (metrics) ──┐
|
||||
├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push
|
||||
Loki (logs) ──────────┘ │
|
||||
│
|
||||
Notification Policy
|
||||
(group_wait: 1m,
|
||||
group_interval: 12h,
|
||||
repeat_interval: 24h)
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed |
|
||||
| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works |
|
||||
| **Anti-noise** | 24h repeat interval | Page once per day max per alert group |
|
||||
| **Runbooks** | `docs/how-to/alerts/<name>.md` | Clickable link in every notification |
|
||||
| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI |
|
||||
| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience |
|
||||
|
||||
## Alerting Policy
|
||||
|
||||
- Each alert fires **once** and does not re-notify for 24 hours
|
||||
- A "resolved" notification is sent when the condition clears
|
||||
- Every alert annotation includes `runbook_url` linking to its how-to doc
|
||||
- The ntfy message template renders the runbook URL as a clickable action button
|
||||
- Alerts are grouped by service to avoid notification storms
|
||||
|
||||
## Migration Path
|
||||
|
||||
1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template
|
||||
2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure)
|
||||
3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook
|
||||
4. Refactor services-check to query the Grafana alerting API instead of doing its own probes
|
||||
|
||||
## What services-check Covers Today
|
||||
|
||||
These checks will be migrated to alert rules:
|
||||
|
||||
| Category | Checks | Data Source |
|
||||
|----------|--------|-------------|
|
||||
| Local services (indri) | forgejo, alloy, borgmatic, zot via brew/launchctl | Need new probes or textfile metrics |
|
||||
| Metrics textfiles | freshness of `.prom` files | Existing node_textfile metrics |
|
||||
| K8s cluster health | minikube API, k3s API | kube-state-metrics |
|
||||
| HTTP endpoints | ~12 services via Caddy | Alloy blackbox exporter (already exists) |
|
||||
| Ringtail | SSH, tailscale, k3s health | Need new probes |
|
||||
| K3s pods | ntfy, authentik, frigate, etc. | kube-state-metrics on ringtail |
|
||||
| Public services | docs, cv, forge via Fly.io | Alloy on Fly.io or external probe |
|
||||
| PostgreSQL | CNPG readiness | CNPG metrics (already scraped) |
|
||||
| ArgoCD sync | app sync/health status | ArgoCD metrics or API |
|
||||
|
||||
## Related
|
||||
|
||||
- [[configure-grafana-alerting-pipeline]] — Foundation: contact point, policy, template
|
||||
- [[first-alert-and-runbook]] — Proof of concept alert
|
||||
- [[port-services-check-alerts]] — Systematic migration
|
||||
- [[refactor-services-check-to-query-alerts]] — Final integration
|
||||
- [[observability]] — Current observability stack
|
||||
- [[ntfy]] — Push notification service
|
||||
- [[grafana]] — Dashboard and alerting platform
|
||||
70
docs/how-to/alerts/first-alert-and-runbook.md
Normal file
70
docs/how-to/alerts/first-alert-and-runbook.md
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
---
|
||||
title: First Alert and Runbook
|
||||
modified: 2026-03-22
|
||||
status: active
|
||||
requires:
|
||||
- configure-grafana-alerting-pipeline
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
---
|
||||
|
||||
# First Alert and Runbook
|
||||
|
||||
Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.
|
||||
|
||||
## What to Do
|
||||
|
||||
### 1. Choose the First Alert
|
||||
|
||||
The best candidate is a **blackbox probe failure** because:
|
||||
- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
|
||||
- The metric `probe_success` is already in Prometheus
|
||||
- It maps directly to what services-check does (HTTP health checks)
|
||||
- A single alert rule with a `service` label can cover all probed services
|
||||
|
||||
### 2. Create the Alert Rule
|
||||
|
||||
Provision via YAML in the alerting provisioning ConfigMap. The rule should:
|
||||
- Query `probe_success == 0` from Prometheus
|
||||
- Fire after the condition persists for 2 minutes (avoid flapping)
|
||||
- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
|
||||
- Include annotations: `summary`, `runbook_url` pointing to the runbook doc
|
||||
|
||||
### 3. Create the Runbook
|
||||
|
||||
Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining:
|
||||
- What the alert means
|
||||
- How to check which service is down
|
||||
- Common causes and resolution steps
|
||||
- How to silence the alert if the downtime is planned
|
||||
|
||||
### 4. Verify End-to-End
|
||||
|
||||
- Stop one of the probed services (e.g., scale miniflux to 0)
|
||||
- Wait for the alert to fire (~2 minutes)
|
||||
- Confirm ntfy notification arrives with correct summary and runbook link
|
||||
- Click the runbook link and verify it reaches docs.eblu.me
|
||||
- Scale the service back up
|
||||
- Confirm "resolved" notification arrives
|
||||
- Confirm no repeat notification during the 24h window
|
||||
|
||||
## Key Details
|
||||
|
||||
- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
|
||||
- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
|
||||
- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure`
|
||||
|
||||
## Verification
|
||||
|
||||
- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules
|
||||
- [ ] Simulated failure triggers ntfy notification within ~3 minutes
|
||||
- [ ] Notification includes service name, summary, and clickable runbook link
|
||||
- [ ] Resolution triggers a "resolved" notification
|
||||
- [ ] No repeat notification within 24h window
|
||||
|
||||
## Related
|
||||
|
||||
- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
|
||||
- [[deploy-infra-alerting]] — Parent goal
|
||||
- [[port-services-check-alerts]] — Next: port remaining checks
|
||||
77
docs/how-to/alerts/port-services-check-alerts.md
Normal file
77
docs/how-to/alerts/port-services-check-alerts.md
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
---
|
||||
title: Port services-check Alerts to Grafana
|
||||
modified: 2026-03-22
|
||||
status: active
|
||||
requires:
|
||||
- first-alert-and-runbook
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
---
|
||||
|
||||
# Port services-check Alerts to Grafana
|
||||
|
||||
Systematically migrate the health checks from `mise run services-check` to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today.
|
||||
|
||||
## What to Do
|
||||
|
||||
### 1. Inventory and Prioritize
|
||||
|
||||
Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation.
|
||||
|
||||
**Already have metrics (easy):**
|
||||
- HTTP endpoint probes → Alloy blackbox exporter (`probe_success`)
|
||||
- PostgreSQL health → CNPG metrics (`cnpg_pg_replication_streaming`, `cnpg_collector_up`)
|
||||
- K8s pod health → kube-state-metrics (`kube_pod_status_phase`)
|
||||
- ArgoCD sync status → ArgoCD metrics (`argocd_app_info` with sync/health labels)
|
||||
|
||||
**Need new probes or metrics:**
|
||||
- Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes
|
||||
- Metrics textfile freshness → `node_textfile_mtime_seconds` (already collected by Alloy on indri)
|
||||
- Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe
|
||||
- Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring
|
||||
|
||||
### 2. Add Missing Probes
|
||||
|
||||
Extend Alloy configurations where needed:
|
||||
- **Alloy on indri:** Add blackbox targets for forgejo, zot (local HTTP endpoints)
|
||||
- **Alloy on ringtail:** Add blackbox targets for ringtail-local services
|
||||
- **Consider:** Whether public endpoint probing belongs in Fly.io Alloy or a separate prober
|
||||
|
||||
### 3. Create Alert Rules
|
||||
|
||||
For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints").
|
||||
|
||||
### 4. Create Runbooks
|
||||
|
||||
One runbook per alert type in `docs/how-to/alerts/runbook-<name>.md`. Each runbook should cover:
|
||||
- What the alert means
|
||||
- Diagnostic steps
|
||||
- Common fixes
|
||||
- How to silence for planned maintenance
|
||||
|
||||
### 5. Remove from services-check
|
||||
|
||||
As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows.
|
||||
|
||||
## Key Details
|
||||
|
||||
- Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain
|
||||
- Prioritize checks that have caught real problems in the past
|
||||
- Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases
|
||||
- The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward
|
||||
|
||||
## Verification
|
||||
|
||||
- [ ] All HTTP endpoint checks from services-check have corresponding alert rules
|
||||
- [ ] Pod health checks have corresponding alert rules
|
||||
- [ ] PostgreSQL health has a corresponding alert rule
|
||||
- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/`
|
||||
- [ ] Test at least 2-3 failure scenarios end-to-end
|
||||
- [ ] services-check script has been updated to reflect ported checks
|
||||
|
||||
## Related
|
||||
|
||||
- [[first-alert-and-runbook]] — Prerequisite: established the pattern
|
||||
- [[deploy-infra-alerting]] — Parent goal
|
||||
- [[refactor-services-check-to-query-alerts]] — Next: make services-check query alerts
|
||||
|
|
@ -0,0 +1,56 @@
|
|||
---
|
||||
title: Refactor services-check to Query Alerts
|
||||
modified: 2026-03-22
|
||||
status: active
|
||||
requires:
|
||||
- port-services-check-alerts
|
||||
tags:
|
||||
- how-to
|
||||
- alerting
|
||||
---
|
||||
|
||||
# Refactor services-check to Query Alerts
|
||||
|
||||
Change `mise run services-check` from doing its own health probes to querying the Grafana alerting API for currently firing alerts. The script becomes a CLI view into the same alerting system that sends ntfy notifications.
|
||||
|
||||
## What to Do
|
||||
|
||||
### 1. Query the Grafana Alerting API
|
||||
|
||||
Grafana exposes alert state via:
|
||||
- `GET /api/v1/provisioning/alert-rules` — all configured rules
|
||||
- `GET /api/prometheus/grafana/api/v1/alerts` — currently firing alerts (Prometheus-compatible format)
|
||||
|
||||
The second endpoint is simpler — it returns only active alerts with labels and annotations, similar to Alertmanager's `/api/v1/alerts`.
|
||||
|
||||
### 2. Rewrite services-check
|
||||
|
||||
The new services-check should:
|
||||
1. Query the Grafana alerting API for firing alerts
|
||||
2. Display them in a table with service name, alert name, duration, and runbook link
|
||||
3. If no alerts are firing, print a green "all clear" message
|
||||
4. Exit 0 if no alerts, exit 1 if any are firing
|
||||
5. Optionally keep a few checks that don't map to alerting (e.g., the ArgoCD sync status table as a summary view)
|
||||
|
||||
### 3. Handle Authentication
|
||||
|
||||
services-check will need a Grafana API token or service account token. Options:
|
||||
- Use the existing Grafana admin credentials from 1Password (`op read`)
|
||||
- Create a dedicated read-only service account in Grafana
|
||||
|
||||
### 4. Preserve the ArgoCD Summary
|
||||
|
||||
The ArgoCD sync/health table in services-check is a useful quick view even when nothing is alerting. Consider keeping it as a separate section that always displays, independent of the alert query.
|
||||
|
||||
## Verification
|
||||
|
||||
- [ ] `mise run services-check` queries Grafana instead of doing direct probes
|
||||
- [ ] Firing alerts are displayed with service name, alert name, and runbook link
|
||||
- [ ] Exit code reflects alert state (0 = clear, 1 = firing)
|
||||
- [ ] Works when Grafana is unreachable (graceful error, not a crash)
|
||||
- [ ] ArgoCD summary table still works
|
||||
|
||||
## Related
|
||||
|
||||
- [[port-services-check-alerts]] — Prerequisite: alerts must exist to query
|
||||
- [[deploy-infra-alerting]] — Parent goal
|
||||
Loading…
Add table
Add a link
Reference in a new issue