C2(deploy-infra-alerting): finalize rewrite cards as historical docs

Remove all Mikado frontmatter (status, branch, requires) from chain
cards. Rename docs/how-to/alerts/ to docs/how-to/runbooks/ and update
all runbook_url references. Add changelog fragment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-03-22 14:24:47 -07:00
commit 67883950c3
13 changed files with 12 additions and 22 deletions

View file

@ -40,7 +40,7 @@ groups:
annotations:
summary: >-
{{ index $labels "service" }} health check is failing
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure
labels:
severity: warning
data:
@ -98,7 +98,7 @@ groups:
annotations:
summary: >-
Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-textfile-stale
labels:
severity: warning
service: indri-metrics
@ -156,7 +156,7 @@ groups:
annotations:
summary: >-
Frigate camera {{ index $labels "camera_name" }} has 0 FPS
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-frigate-camera-down
labels:
severity: warning
service: frigate
@ -213,7 +213,7 @@ groups:
annotations:
summary: >-
PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-postgres-unhealthy
labels:
severity: critical
service: postgresql
@ -270,7 +270,7 @@ groups:
annotations:
summary: >-
Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-pod-not-ready
labels:
severity: warning
data:
@ -329,7 +329,7 @@ groups:
annotations:
summary: >-
ArgoCD app {{ index $labels "name" }} is {{ index $labels "sync_status" }}
runbook_url: https://docs.eblu.me/how-to/alerts/runbook-argocd-out-of-sync
runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-argocd-out-of-sync
labels:
severity: warning
service: argocd

View file

@ -0,0 +1 @@
Deploy infrastructure alerting pipeline using Grafana Unified Alerting with ntfy push notifications. 7 alert rules with runbooks covering service health, pod readiness, PostgreSQL, textfile freshness, Frigate cameras, and ArgoCD sync status. services-check now queries the alerting API for covered checks.

View file

@ -1,10 +1,6 @@
---
title: Deploy Infrastructure Alerting Pipeline
modified: 2026-03-22
status: active
branch: mikado/deploy-infra-alerting
requires:
- refactor-services-check-to-query-alerts
tags:
- how-to
- alerting
@ -35,7 +31,7 @@ Loki (logs) ──────────┘ │
| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed |
| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works |
| **Anti-noise** | 24h repeat interval | Page once per day max per alert group |
| **Runbooks** | `docs/how-to/alerts/<name>.md` | Clickable link in every notification |
| **Runbooks** | `docs/how-to/runbooks/<name>.md` | Clickable link in every notification |
| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI |
| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience |

View file

@ -1,8 +1,6 @@
---
title: First Alert and Runbook
modified: 2026-03-22
requires:
- configure-grafana-alerting-pipeline
tags:
- how-to
- alerting
@ -32,7 +30,7 @@ Provision via YAML in the alerting provisioning ConfigMap. The rule should:
### 3. Create the Runbook
Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining:
Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining:
- What the alert means
- How to check which service is down
- Common causes and resolution steps
@ -52,7 +50,7 @@ Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc expl
- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure`
- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure`
## Verification

View file

@ -1,8 +1,6 @@
---
title: Port services-check Alerts to Grafana
modified: 2026-03-22
requires:
- first-alert-and-runbook
tags:
- how-to
- alerting
@ -43,7 +41,7 @@ For each check category, create provisioned Grafana alert rules. Group related c
### 4. Create Runbooks
One runbook per alert type in `docs/how-to/alerts/runbook-<name>.md`. Each runbook should cover:
One runbook per alert type in `docs/how-to/runbooks/runbook-<name>.md`. Each runbook should cover:
- What the alert means
- Diagnostic steps
- Common fixes
@ -65,7 +63,7 @@ As each check is ported, remove it from the services-check script (or mark it as
- [ ] All HTTP endpoint checks from services-check have corresponding alert rules
- [ ] Pod health checks have corresponding alert rules
- [ ] PostgreSQL health has a corresponding alert rule
- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/`
- [ ] Each alert rule has a runbook doc in `docs/how-to/runbooks/`
- [ ] Test at least 2-3 failure scenarios end-to-end
- [ ] services-check script has been updated to reflect ported checks

View file

@ -1,9 +1,6 @@
---
title: Refactor services-check to Query Alerts
modified: 2026-03-22
status: active
requires:
- port-services-check-alerts
tags:
- how-to
- alerting