From 1d5990a2f704370351b7dcb7e8e188a398be782a Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 10:28:31 -0700
Subject: [PATCH 01/18] C2(deploy-infra-alerting): plan add alerting pipeline
 cards
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mikado chain for deploying Grafana Unified Alerting with ntfy
notifications, replacing manual services-check probes.

Chain: configure-grafana-alerting-pipeline
     → first-alert-and-runbook
     → port-services-check-alerts
     → refactor-services-check-to-query-alerts
     → deploy-infra-alerting (goal)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../configure-grafana-alerting-pipeline.md    | 60 ++++++++++++++
 docs/how-to/alerts/deploy-infra-alerting.md   | 81 +++++++++++++++++++
 docs/how-to/alerts/first-alert-and-runbook.md | 70 ++++++++++++++++
 .../alerts/port-services-check-alerts.md      | 77 ++++++++++++++++++
 ...refactor-services-check-to-query-alerts.md | 56 +++++++++++++
 5 files changed, 344 insertions(+)
 create mode 100644 docs/how-to/alerts/configure-grafana-alerting-pipeline.md
 create mode 100644 docs/how-to/alerts/deploy-infra-alerting.md
 create mode 100644 docs/how-to/alerts/first-alert-and-runbook.md
 create mode 100644 docs/how-to/alerts/port-services-check-alerts.md
 create mode 100644 docs/how-to/alerts/refactor-services-check-to-query-alerts.md
diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md
new file mode 100644
index 0000000..2c6999b
--- /dev/null
+++ b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md
@@ -0,0 +1,60 @@
+---
+title: Configure Grafana Alerting Pipeline
+modified: 2026-03-22
+status: active
+tags:
+  - how-to
+  - alerting
+  - grafana
+---
+
+# Configure Grafana Alerting Pipeline
+
+Enable Grafana Unified Alerting, create an ntfy webhook contact point, configure the notification policy with anti-noise settings, and set up a message template with runbook links.
+
+## What to Do
+
+### 1. Enable Unified Alerting in grafana.ini
+
+Add the `[unified_alerting]` section to the Grafana ConfigMap. Grafana 11+ has unified alerting enabled by default, but we should be explicit and configure the evaluation interval.
+
+### 2. Create Alerting Provisioning Files
+
+Grafana supports provisioning alert resources via YAML files in `/etc/grafana/provisioning/alerting/`. Create:
+
+- **Contact point** — ntfy webhook targeting `http://ntfy.ntfy.svc.cluster.local:80/infra-alerts` (cluster-internal, since Grafana and ntfy are on different clusters, use `ntfy.ops.eblu.me` via Caddy instead)
+- **Notification policy** — root policy with `group_wait: 1m`, `group_interval: 12h`, `repeat_interval: 24h`, grouped by `alertname` and `service`
+- **Message template** — format that includes alert name, summary, and a clickable runbook URL as an ntfy action button
+
+### 3. Mount Provisioning into Grafana
+
+Add the alerting provisioning ConfigMap to the Grafana deployment, mounted at `/etc/grafana/provisioning/alerting/`.
+
+### 4. Create the `infra-alerts` Topic
+
+ntfy topics are created on first publish — no explicit setup needed. But verify that the topic works by sending a test notification.
+
+### 5. Verify End-to-End
+
+- Grafana UI shows the ntfy contact point under Alerting → Contact Points
+- Notification policy shows the anti-noise settings
+- Test notification from Grafana reaches the ntfy iOS app
+
+## Key Details
+
+- Grafana runs on minikube (indri), ntfy runs on k3s (ringtail). The contact point URL must go through Caddy: `https://ntfy.ops.eblu.me/infra-alerts`
+- ntfy action buttons use the `X-Actions` header or JSON body format: `view, Open Runbook, <url>`
+- Grafana provisioning files are applied on startup and cannot be edited from the UI (which is what we want for GitOps)
+
+## Verification
+
+- [ ] Grafana starts with unified alerting enabled
+- [ ] Contact point `ntfy-infra` visible in Grafana UI
+- [ ] Notification policy shows correct group/repeat intervals
+- [ ] Test notification arrives on iOS via ntfy app
+- [ ] Test notification includes a clickable runbook link
+
+## Related
+
+- [[deploy-infra-alerting]] — Parent goal
+- [[first-alert-and-runbook]] — Next: create the first real alert
diff --git a/docs/how-to/alerts/deploy-infra-alerting.md b/docs/how-to/alerts/deploy-infra-alerting.md
new file mode 100644
index 0000000..7c2e7f0
--- /dev/null
+++ b/docs/how-to/alerts/deploy-infra-alerting.md
@@ -0,0 +1,81 @@
+---
+title: Deploy Infrastructure Alerting Pipeline
+modified: 2026-03-22
+status: active
+branch: mikado/deploy-infra-alerting
+requires:
+  - refactor-services-check-to-query-alerts
+tags:
+  - how-to
+  - alerting
+  - observability
+---
+
+# Deploy Infrastructure Alerting Pipeline
+
+Replace the manual `mise run services-check` approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links.
+
+## Architecture
+
+```
+Prometheus (metrics) ──┐
+                       ├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push
+Loki (logs) ──────────┘          │
+                                 │
+                          Notification Policy
+                          (group_wait: 1m,
+                           group_interval: 12h,
+                           repeat_interval: 24h)
+```
+
+## Design Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed |
+| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works |
+| **Anti-noise** | 24h repeat interval | Page once per day max per alert group |
+| **Runbooks** | `docs/how-to/alerts/<name>.md` | Clickable link in every notification |
+| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI |
+| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience |
+
+## Alerting Policy
+
+- Each alert fires **once** and does not re-notify for 24 hours
+- A "resolved" notification is sent when the condition clears
+- Every alert annotation includes `runbook_url` linking to its how-to doc
+- The ntfy message template renders the runbook URL as a clickable action button
+- Alerts are grouped by service to avoid notification storms
+
+## Migration Path
+
+1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template
+2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure)
+3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook
+4. Refactor services-check to query the Grafana alerting API instead of doing its own probes
+
+## What services-check Covers Today
+
+These checks will be migrated to alert rules:
+
+| Category | Checks | Data Source |
+|----------|--------|-------------|
+| Local services (indri) | forgejo, alloy, borgmatic, zot via brew/launchctl | Need new probes or textfile metrics |
+| Metrics textfiles | freshness of `.prom` files | Existing node_textfile metrics |
+| K8s cluster health | minikube API, k3s API | kube-state-metrics |
+| HTTP endpoints | ~12 services via Caddy | Alloy blackbox exporter (already exists) |
+| Ringtail | SSH, tailscale, k3s health | Need new probes |
+| K3s pods | ntfy, authentik, frigate, etc. | kube-state-metrics on ringtail |
+| Public services | docs, cv, forge via Fly.io | Alloy on Fly.io or external probe |
+| PostgreSQL | CNPG readiness | CNPG metrics (already scraped) |
+| ArgoCD sync | app sync/health status | ArgoCD metrics or API |
+
+## Related
+
+- [[configure-grafana-alerting-pipeline]] — Foundation: contact point, policy, template
+- [[first-alert-and-runbook]] — Proof of concept alert
+- [[port-services-check-alerts]] — Systematic migration
+- [[refactor-services-check-to-query-alerts]] — Final integration
+- [[observability]] — Current observability stack
+- [[ntfy]] — Push notification service
+- [[grafana]] — Dashboard and alerting platform
diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/alerts/first-alert-and-runbook.md
new file mode 100644
index 0000000..bec5aaa
--- /dev/null
+++ b/docs/how-to/alerts/first-alert-and-runbook.md
@@ -0,0 +1,70 @@
+---
+title: First Alert and Runbook
+modified: 2026-03-22
+status: active
+requires:
+  - configure-grafana-alerting-pipeline
+tags:
+  - how-to
+  - alerting
+---
+
+# First Alert and Runbook
+
+Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.
+
+## What to Do
+
+### 1. Choose the First Alert
+
+The best candidate is a **blackbox probe failure** because:
+- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
+- The metric `probe_success` is already in Prometheus
+- It maps directly to what services-check does (HTTP health checks)
+- A single alert rule with a `service` label can cover all probed services
+
+### 2. Create the Alert Rule
+
+Provision via YAML in the alerting provisioning ConfigMap. The rule should:
+- Query `probe_success == 0` from Prometheus
+- Fire after the condition persists for 2 minutes (avoid flapping)
+- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
+- Include annotations: `summary`, `runbook_url` pointing to the runbook doc
+
+### 3. Create the Runbook
+
+Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining:
+- What the alert means
+- How to check which service is down
+- Common causes and resolution steps
+- How to silence the alert if the downtime is planned
+
+### 4. Verify End-to-End
+
+- Stop one of the probed services (e.g., scale miniflux to 0)
+- Wait for the alert to fire (~2 minutes)
+- Confirm ntfy notification arrives with correct summary and runbook link
+- Click the runbook link and verify it reaches docs.eblu.me
+- Scale the service back up
+- Confirm "resolved" notification arrives
+- Confirm no repeat notification during the 24h window
+
+## Key Details
+
+- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
+- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
+- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure`
+
+## Verification
+
+- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules
+- [ ] Simulated failure triggers ntfy notification within ~3 minutes
+- [ ] Notification includes service name, summary, and clickable runbook link
+- [ ] Resolution triggers a "resolved" notification
+- [ ] No repeat notification within 24h window
+
+## Related
+
+- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
+- [[deploy-infra-alerting]] — Parent goal
+- [[port-services-check-alerts]] — Next: port remaining checks
diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/alerts/port-services-check-alerts.md
new file mode 100644
index 0000000..807c340
--- /dev/null
+++ b/docs/how-to/alerts/port-services-check-alerts.md
@@ -0,0 +1,77 @@
+---
+title: Port services-check Alerts to Grafana
+modified: 2026-03-22
+status: active
+requires:
+  - first-alert-and-runbook
+tags:
+  - how-to
+  - alerting
+---
+
+# Port services-check Alerts to Grafana
+
+Systematically migrate the health checks from `mise run services-check` to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today.
+
+## What to Do
+
+### 1. Inventory and Prioritize
+
+Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation.
+
+**Already have metrics (easy):**
+- HTTP endpoint probes → Alloy blackbox exporter (`probe_success`)
+- PostgreSQL health → CNPG metrics (`cnpg_pg_replication_streaming`, `cnpg_collector_up`)
+- K8s pod health → kube-state-metrics (`kube_pod_status_phase`)
+- ArgoCD sync status → ArgoCD metrics (`argocd_app_info` with sync/health labels)
+
+**Need new probes or metrics:**
+- Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes
+- Metrics textfile freshness → `node_textfile_mtime_seconds` (already collected by Alloy on indri)
+- Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe
+- Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring
+
+### 2. Add Missing Probes
+
+Extend Alloy configurations where needed:
+- **Alloy on indri:** Add blackbox targets for forgejo, zot (local HTTP endpoints)
+- **Alloy on ringtail:** Add blackbox targets for ringtail-local services
+- **Consider:** Whether public endpoint probing belongs in Fly.io Alloy or a separate prober
+
+### 3. Create Alert Rules
+
+For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints").
+
+### 4. Create Runbooks
+
+One runbook per alert type in `docs/how-to/alerts/runbook-<name>.md`. Each runbook should cover:
+- What the alert means
+- Diagnostic steps
+- Common fixes
+- How to silence for planned maintenance
+
+### 5. Remove from services-check
+
+As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows.
+
+## Key Details
+
+- Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain
+- Prioritize checks that have caught real problems in the past
+- Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases
+- The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward
+
+## Verification
+
+- [ ] All HTTP endpoint checks from services-check have corresponding alert rules
+- [ ] Pod health checks have corresponding alert rules
+- [ ] PostgreSQL health has a corresponding alert rule
+- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/`
+- [ ] Test at least 2-3 failure scenarios end-to-end
+- [ ] services-check script has been updated to reflect ported checks
+
+## Related
+
+- [[first-alert-and-runbook]] — Prerequisite: established the pattern
+- [[deploy-infra-alerting]] — Parent goal
+- [[refactor-services-check-to-query-alerts]] — Next: make services-check query alerts
diff --git a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md
new file mode 100644
index 0000000..640bcff
--- /dev/null
+++ b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md
@@ -0,0 +1,56 @@
+---
+title: Refactor services-check to Query Alerts
+modified: 2026-03-22
+status: active
+requires:
+  - port-services-check-alerts
+tags:
+  - how-to
+  - alerting
+---
+
+# Refactor services-check to Query Alerts
+
+Change `mise run services-check` from doing its own health probes to querying the Grafana alerting API for currently firing alerts. The script becomes a CLI view into the same alerting system that sends ntfy notifications.
+
+## What to Do
+
+### 1. Query the Grafana Alerting API
+
+Grafana exposes alert state via:
+- `GET /api/v1/provisioning/alert-rules` — all configured rules
+- `GET /api/prometheus/grafana/api/v1/alerts` — currently firing alerts (Prometheus-compatible format)
+
+The second endpoint is simpler — it returns only active alerts with labels and annotations, similar to Alertmanager's `/api/v1/alerts`.
+
+### 2. Rewrite services-check
+
+The new services-check should:
+1. Query the Grafana alerting API for firing alerts
+2. Display them in a table with service name, alert name, duration, and runbook link
+3. If no alerts are firing, print a green "all clear" message
+4. Exit 0 if no alerts, exit 1 if any are firing
+5. Optionally keep a few checks that don't map to alerting (e.g., the ArgoCD sync status table as a summary view)
+
+### 3. Handle Authentication
+
+services-check will need a Grafana API token or service account token. Options:
+- Use the existing Grafana admin credentials from 1Password (`op read`)
+- Create a dedicated read-only service account in Grafana
+
+### 4. Preserve the ArgoCD Summary
+
+The ArgoCD sync/health table in services-check is a useful quick view even when nothing is alerting. Consider keeping it as a separate section that always displays, independent of the alert query.
+
+## Verification
+
+- [ ] `mise run services-check` queries Grafana instead of doing direct probes
+- [ ] Firing alerts are displayed with service name, alert name, and runbook link
+- [ ] Exit code reflects alert state (0 = clear, 1 = firing)
+- [ ] Works when Grafana is unreachable (graceful error, not a crash)
+- [ ] ArgoCD summary table still works
+
+## Related
+
+- [[port-services-check-alerts]] — Prerequisite: alerts must exist to query
+- [[deploy-infra-alerting]] — Parent goal

From 261f20601a289a12e8df680dc75324fbe4ed9d4a Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 10:35:36 -0700
Subject: [PATCH 02/18] C2(deploy-infra-alerting): impl configure grafana
 alerting pipeline

- Enable unified alerting in grafana.ini
- Create alerting.yaml provisioning file with:
  - ntfy-infra webhook contact point (POST to ntfy.ops.eblu.me/infra-alerts)
  - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h
  - Message templates for title and runbook links
- Mount alerting provisioning into Grafana deployment
- Add alerting.yaml to kustomization configMapGenerator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml      | 42 +++++++++++++++++++++
 argocd/manifests/grafana/deployment.yaml    |  3 ++
 argocd/manifests/grafana/grafana.ini        |  5 +++
 argocd/manifests/grafana/kustomization.yaml |  1 +
 4 files changed, 51 insertions(+)
 create mode 100644 argocd/manifests/grafana/alerting.yaml

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
new file mode 100644
index 0000000..3ac33b0
--- /dev/null
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -0,0 +1,42 @@
+apiVersion: 1
+
+contactPoints:
+  - orgId: 1
+    name: ntfy-infra
+    receivers:
+      - uid: ntfy-infra-webhook
+        type: webhook
+        settings:
+          url: https://ntfy.ops.eblu.me/infra-alerts
+          httpMethod: POST
+          title: >-
+            {{ template "ntfy-infra.title" . }}
+          message: >-
+            {{ template "ntfy-infra.message" . }}
+          maxAlerts: "0"
+        disableResolveMessage: false
+
+policies:
+  - orgId: 1
+    receiver: ntfy-infra
+    group_by:
+      - alertname
+      - service
+    group_wait: 1m
+    group_interval: 12h
+    repeat_interval: 24h
+
+templates:
+  - orgId: 1
+    name: ntfy-infra
+    template: |
+      {{ define "ntfy-infra.title" -}}
+      [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
+      {{- end }}
+
+      {{ define "ntfy-infra.message" -}}
+      {{ range .Alerts -}}
+      {{ .Annotations.summary }}
+      {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
+      {{ end -}}
+      {{- end }}
diff --git a/argocd/manifests/grafana/deployment.yaml b/argocd/manifests/grafana/deployment.yaml
index 61a2f88..5fbb8eb 100644
--- a/argocd/manifests/grafana/deployment.yaml
+++ b/argocd/manifests/grafana/deployment.yaml
@@ -277,6 +277,9 @@ spec:
             - name: config
               mountPath: /etc/grafana/provisioning/datasources/datasources.yaml
               subPath: datasources.yaml
+            - name: config
+              mountPath: /etc/grafana/provisioning/alerting/alerting.yaml
+              subPath: alerting.yaml
             - name: storage
               mountPath: /var/lib/grafana
             - name: sc-dashboard-volume
diff --git a/argocd/manifests/grafana/grafana.ini b/argocd/manifests/grafana/grafana.ini
index 61cdd7e..a0a6db8 100644
--- a/argocd/manifests/grafana/grafana.ini
+++ b/argocd/manifests/grafana/grafana.ini
@@ -30,3 +30,8 @@ allow_embedding = false
 
 [server]
 root_url = https://grafana.ops.eblu.me
+
+[unified_alerting]
+enabled = true
+evaluation_timeout = 30s
+min_interval = 10s
diff --git a/argocd/manifests/grafana/kustomization.yaml b/argocd/manifests/grafana/kustomization.yaml
index c052bf9..3aeaa26 100644
--- a/argocd/manifests/grafana/kustomization.yaml
+++ b/argocd/manifests/grafana/kustomization.yaml
@@ -25,6 +25,7 @@ configMapGenerator:
     files:
       - grafana.ini
       - datasources.yaml
+      - alerting.yaml
     options:
       labels:
         app.kubernetes.io/name: grafana

From c1acc808d5d943cc33fc94c82ab130b7bc360970 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 10:51:02 -0700
Subject: [PATCH 03/18] C2(deploy-infra-alerting): close
 configure-grafana-alerting-pipeline

Pipeline verified:
- Grafana unified alerting enabled with provisioned contact point and policy
- ntfy webhook contact point delivering to infra-alerts topic
- Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h
- iOS push notifications confirmed working

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/how-to/alerts/configure-grafana-alerting-pipeline.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md
index 2c6999b..eb90128 100644
--- a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md
+++ b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md
@@ -1,7 +1,6 @@
 ---
 title: Configure Grafana Alerting Pipeline
 modified: 2026-03-22
-status: active
 tags:
   - how-to
   - alerting

From 549c57ab82127aaa1e6ee4b6db57f5e8bf504355 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 10:57:23 -0700
Subject: [PATCH 04/18] C2(deploy-infra-alerting): impl add first alert rule
 and runbook

- Add ServiceProbeFailure alert rule to Grafana alerting provisioning
  - Queries probe_success metric from Alloy blackbox exporter
  - Extracts service name from job label via label_replace
  - Fires after 2 minutes of failure, noDataState=Alerting
  - Annotations include summary with service name and runbook URL
- Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md
  - Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd)
  - Diagnostic steps, common causes, silencing instructions
- Add alerting section to observability.md reference doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml        | 49 ++++++++++++
 .../alerts/runbook-service-probe-failure.md   | 75 +++++++++++++++++++
 docs/reference/operations/observability.md    |  7 +-
 3 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100644 docs/how-to/alerts/runbook-service-probe-failure.md

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index 3ac33b0..3fe4b1c 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -26,6 +26,55 @@ policies:
     group_interval: 12h
     repeat_interval: 24h
 
+groups:
+  - orgId: 1
+    name: service-health
+    folder: Infrastructure Alerts
+    interval: 30s
+    rules:
+      - uid: service-probe-failure
+        title: ServiceProbeFailure
+        condition: B
+        for: 2m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            {{ index $labels "service" }} health check is failing
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure
+        labels:
+          severity: warning
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                label_replace(probe_success, "service",
+                "$1", "job", "integrations/blackbox/(.*)")
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: A
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+                  reducer:
+                    type: last
+              refId: B
+
 templates:
   - orgId: 1
     name: ntfy-infra
diff --git a/docs/how-to/alerts/runbook-service-probe-failure.md b/docs/how-to/alerts/runbook-service-probe-failure.md
new file mode 100644
index 0000000..575606e
--- /dev/null
+++ b/docs/how-to/alerts/runbook-service-probe-failure.md
@@ -0,0 +1,75 @@
+---
+title: "Runbook: Service Probe Failure"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Service Probe Failure
+
+**Alert name:** `ServiceProbeFailure`
+
+A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.
+
+## Affected Services
+
+This alert covers services probed by the Alloy blackbox exporter on indri's minikube cluster:
+
+| Service | Health Endpoint |
+|---------|----------------|
+| miniflux | `/healthcheck` |
+| kiwix | `/` |
+| transmission | `/transmission/web/` |
+| devpi | `/+api` |
+| argocd | `/healthz` |
+
+The failing service is identified by the `service` label in the alert (extracted from the `job` label).
+
+## Diagnostic Steps
+
+1. **Check which service is down** — the alert label `service` tells you. You can also run:
+   ```fish
+   kubectl get pods -n <namespace> --context=minikube-indri
+   ```
+
+2. **Check pod status** — look for CrashLoopBackOff, OOMKilled, or pending pods:
+   ```fish
+   kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri
+   ```
+
+3. **Check pod logs**:
+   ```fish
+   kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50
+   ```
+
+4. **Check if minikube itself is healthy**:
+   ```fish
+   ssh indri 'minikube status'
+   ```
+
+5. **Check NFS mounts** (kiwix, transmission depend on sifaka NFS):
+   ```fish
+   ssh indri 'df -h | grep Volumes'
+   ```
+
+## Common Causes
+
+- **Pod crashed** — check logs, restart with `kubectl delete pod`
+- **NFS mount lost** — sifaka offline or AutoMounter not running. SSH to indri and check `/Volumes/`
+- **Resource exhaustion** — check `kubectl top pods -n <namespace>` for memory/CPU pressure
+- **Minikube paused/stopped** — `ssh indri 'minikube status'`, restart if needed
+
+## Silencing
+
+For planned maintenance, silence this alert in Grafana:
+1. Go to Alerting → Silences → Create Silence
+2. Match label `alertname = ServiceProbeFailure`
+3. Optionally match `service = <specific-service>` to silence only one
+4. Set duration for your maintenance window
+
+## Related
+
+- [[deploy-infra-alerting]] — Alerting pipeline overview
+- [[configure-grafana-alerting-pipeline]] — Pipeline configuration
diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md
index 5890147..852f5d3 100644
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@@ -1,6 +1,6 @@
 ---
 title: Observability
-modified: 2026-02-07
+modified: 2026-03-22
 tags:
   - operations
 ---
@@ -16,3 +16,8 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
 - [[tempo]] - Distributed tracing
 - [[alloy|Alloy]] - Metrics, log, and trace collection
 - [[grafana]] - Dashboards and visualization
+
+## Alerting
+
+- [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy)
+- [[runbook-service-probe-failure]] - Service health check failure runbook

From 94413f73ba4c74d05d789076569d9a7db71b25a8 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 11:00:21 -0700
Subject: [PATCH 05/18] C2(deploy-infra-alerting): impl fix alert rule
 multi-series evaluation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add reduce step between Prometheus query and threshold to preserve
per-service labels. Without it, Grafana can't distinguish the 5
probe_success series and errors with "duplicate results with labels {}".

Chain: A (prometheus query) → B (reduce last) → C (threshold < 1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index 3fe4b1c..a05ddee 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -34,7 +34,7 @@ groups:
     rules:
       - uid: service-probe-failure
         title: ServiceProbeFailure
-        condition: B
+        condition: C
         for: 2m
         noDataState: Alerting
         execErrState: Alerting
@@ -62,8 +62,20 @@ groups:
               from: 0
               to: 0
             model:
-              type: threshold
+              type: reduce
               expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
               conditions:
                 - evaluator:
                     type: lt
@@ -71,9 +83,7 @@ groups:
                       - 1
                   operator:
                     type: and
-                  reducer:
-                    type: last
-              refId: B
+              refId: C
 
 templates:
   - orgId: 1

From 4c0bd0055f3855c19544e1f96296020d97651f23 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 11:59:26 -0700
Subject: [PATCH 06/18] C2(deploy-infra-alerting): impl use custom payload for
 ntfy-native JSON

Use the correct provisioning field name for Grafana webhook custom
payloads: settings.payload.template (not payloadTemplate).

Found by reading the Go source (grafana/alerting receivers/webhook/v1/config.go):
  Payload *CustomPayload `json:"payload,omitempty"`
  CustomPayload.Template string `json:"template,omitempty"`

The template uses coll.Dict, coll.Append, and data.ToJSON to produce
ntfy-native JSON with topic, title, message, priority, and action
buttons linking to runbooks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 31 ++++++++++++++------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index a05ddee..a190039 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -7,13 +7,12 @@ contactPoints:
       - uid: ntfy-infra-webhook
         type: webhook
         settings:
-          url: https://ntfy.ops.eblu.me/infra-alerts
+          url: https://ntfy.ops.eblu.me
           httpMethod: POST
-          title: >-
-            {{ template "ntfy-infra.title" . }}
-          message: >-
-            {{ template "ntfy-infra.message" . }}
           maxAlerts: "0"
+          payload:
+            template: >-
+              {{ template "ntfy-infra.payload" . }}
         disableResolveMessage: false
 
 policies:
@@ -89,13 +88,17 @@ templates:
   - orgId: 1
     name: ntfy-infra
     template: |
-      {{ define "ntfy-infra.title" -}}
-      [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
-      {{- end }}
-
-      {{ define "ntfy-infra.message" -}}
-      {{ range .Alerts -}}
-      {{ .Annotations.summary }}
-      {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
-      {{ end -}}
+      {{ define "ntfy-infra.payload" -}}
+      {{- $msg := "" -}}
+      {{- range .Alerts -}}
+        {{- $msg = (printf "%s%s\n" $msg .Annotations.summary) -}}
+      {{- end -}}
+      {{- $title := (printf "[%s] %s" (.Status | toUpper) .CommonLabels.alertname) -}}
+      {{- $actions := coll.Slice -}}
+      {{- range .Alerts -}}
+        {{- if .Annotations.runbook_url -}}
+          {{- $actions = coll.Append (coll.Dict "action" "view" "label" "Open Runbook" "url" .Annotations.runbook_url) $actions -}}
+        {{- end -}}
+      {{- end -}}
+      {{- coll.Dict "topic" "infra-alerts" "title" $title "message" $msg "priority" 3 "actions" $actions | data.ToJSON -}}
       {{- end }}

From e33b0bc184b93a5acac68a2b877613229ae6d26b Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 12:05:43 -0700
Subject: [PATCH 07/18] C2(deploy-infra-alerting): close
 first-alert-and-runbook

End-to-end alerting pipeline verified:
- ServiceProbeFailure alert rule evaluates 5 blackbox probes
- Grafana custom payload produces ntfy-native JSON (topic, title,
  message, priority, actions)
- Firing notification arrives on iOS with clean formatting
- "Open Runbook" action button links to docs.eblu.me runbook
- Resolved notification delivered on recovery

Key discovery: Grafana webhook custom payload provisioning field is
settings.payload.template (nested object), not payloadTemplate
(flat string). Found by reading grafana/alerting Go source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/how-to/alerts/first-alert-and-runbook.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/alerts/first-alert-and-runbook.md
index bec5aaa..71b86bf 100644
--- a/docs/how-to/alerts/first-alert-and-runbook.md
+++ b/docs/how-to/alerts/first-alert-and-runbook.md
@@ -1,7 +1,6 @@
 ---
 title: First Alert and Runbook
 modified: 2026-03-22
-status: active
 requires:
   - configure-grafana-alerting-pipeline
 tags:
@@ -68,3 +67,4 @@ Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc expl
 - [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
 - [[deploy-infra-alerting]] — Parent goal
 - [[port-services-check-alerts]] — Next: port remaining checks
+- [[runbook-service-probe-failure]] — The runbook created for this alert

From 8e6a803076cddd89d8d5e0867e3319415800a80f Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 12:11:12 -0700
Subject: [PATCH 08/18] C2(deploy-infra-alerting): impl add probes and alert
 rules for services-check coverage

Extend Alloy blackbox probes:
- Add prometheus, loki, grafana, teslamate, immich, navidrome
- Now probing 11 services (was 5), covering most HTTP checks from
  services-check

Add alert rules:
- PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical)
- PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m

Add runbooks:
- runbook-postgres-unhealthy.md
- runbook-pod-not-ready.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/alloy-k8s/config.alloy       |  37 ++++++
 argocd/manifests/grafana/alerting.yaml        | 114 ++++++++++++++++++
 docs/how-to/alerts/runbook-pod-not-ready.md   |  55 +++++++++
 .../alerts/runbook-postgres-unhealthy.md      |  63 ++++++++++
 docs/reference/operations/observability.md    |   2 +
 5 files changed, 271 insertions(+)
 create mode 100644 docs/how-to/alerts/runbook-pod-not-ready.md
 create mode 100644 docs/how-to/alerts/runbook-postgres-unhealthy.md

diff --git a/argocd/manifests/alloy-k8s/config.alloy b/argocd/manifests/alloy-k8s/config.alloy
index c169c93..667f735 100644
--- a/argocd/manifests/alloy-k8s/config.alloy
+++ b/argocd/manifests/alloy-k8s/config.alloy
@@ -169,6 +169,43 @@ prometheus.exporter.blackbox "services" {
     address = "http://argocd-server.argocd.svc.cluster.local:80/healthz"
     module  = "http_2xx"
   }
+
+  target {
+    name    = "prometheus"
+    address = "http://prometheus.monitoring.svc.cluster.local:9090/-/healthy"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "loki"
+    address = "http://loki.monitoring.svc.cluster.local:3100/ready"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "grafana"
+    address = "http://grafana.monitoring.svc.cluster.local:3000/api/health"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "teslamate"
+    address = "http://teslamate.teslamate.svc.cluster.local:4000/"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "immich"
+    address = "http://immich-server.immich.svc.cluster.local:2283/api/server/ping"
+    module  = "http_2xx"
+  }
+
+  target {
+    name    = "navidrome"
+    address = "http://navidrome.navidrome.svc.cluster.local:4533/"
+    module  = "http_2xx"
+  }
+
 }
 
 // Scrape blackbox probe results
diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index a190039..47f2ec6 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -84,6 +84,120 @@ groups:
                     type: and
               refId: C
 
+  - orgId: 1
+    name: database-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: postgres-cluster-unhealthy
+        title: PostgresClusterUnhealthy
+        condition: C
+        for: 3m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy
+        labels:
+          severity: critical
+          service: postgresql
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: cnpg_collector_up
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
+  - orgId: 1
+    name: pod-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: pod-not-ready
+        title: PodNotReady
+        condition: C
+        for: 5m
+        noDataState: NoData
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready
+        labels:
+          severity: warning
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                kube_pod_status_ready{condition="true"} == 0
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
 templates:
   - orgId: 1
     name: ntfy-infra
diff --git a/docs/how-to/alerts/runbook-pod-not-ready.md b/docs/how-to/alerts/runbook-pod-not-ready.md
new file mode 100644
index 0000000..49dd35e
--- /dev/null
+++ b/docs/how-to/alerts/runbook-pod-not-ready.md
@@ -0,0 +1,55 @@
+---
+title: "Runbook: Pod Not Ready"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Pod Not Ready
+
+**Alert name:** `PodNotReady`
+
+A Kubernetes pod has been in a not-ready state for 5+ minutes.
+
+## Diagnostic Steps
+
+1. **Identify the pod** from the alert labels (`pod`, `namespace`):
+   ```fish
+   kubectl describe pod <pod> -n <namespace> --context=minikube-indri
+   ```
+
+2. **Check events** — look for scheduling failures, image pull errors, or probe failures:
+   ```fish
+   kubectl get events -n <namespace> --context=minikube-indri --sort-by='.lastTimestamp' | tail -20
+   ```
+
+3. **Check logs**:
+   ```fish
+   kubectl logs <pod> -n <namespace> --context=minikube-indri --tail=50
+   ```
+
+4. **Check node resources**:
+   ```fish
+   kubectl top nodes --context=minikube-indri
+   kubectl top pods -n <namespace> --context=minikube-indri
+   ```
+
+## Common Causes
+
+- **CrashLoopBackOff** — app is crashing on startup, check logs
+- **ImagePullBackOff** — container image not found or registry unreachable
+- **Pending** — insufficient resources (CPU/memory), or PVC not bound
+- **Readiness probe failing** — service is running but not healthy
+- **NFS mount issue** — services depending on sifaka (kiwix, transmission, navidrome, jellyfin) will fail if NFS is down
+
+## Silencing
+
+1. Grafana → Alerting → Silences → Create Silence
+2. Match `alertname = PodNotReady`
+3. Optionally match `namespace = <namespace>` to silence a specific service
+
+## Related
+
+- [[deploy-infra-alerting]] — Alerting pipeline overview
diff --git a/docs/how-to/alerts/runbook-postgres-unhealthy.md b/docs/how-to/alerts/runbook-postgres-unhealthy.md
new file mode 100644
index 0000000..2910851
--- /dev/null
+++ b/docs/how-to/alerts/runbook-postgres-unhealthy.md
@@ -0,0 +1,63 @@
+---
+title: "Runbook: PostgreSQL Cluster Unhealthy"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: PostgreSQL Cluster Unhealthy
+
+**Alert name:** `PostgresClusterUnhealthy`
+
+The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.
+
+## Affected Services
+
+The `blumeops-pg` CNPG cluster on indri's minikube runs databases for:
+- TeslaMate
+- Authentik (cross-cluster from ringtail)
+- Immich
+- Grafana dashboards (TeslaMate datasource)
+
+## Diagnostic Steps
+
+1. **Check CNPG cluster status**:
+   ```fish
+   kubectl get cluster blumeops-pg -n databases --context=minikube-indri
+   kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri
+   ```
+
+2. **Check pod logs**:
+   ```fish
+   kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30
+   ```
+
+3. **Check if pg_isready**:
+   ```fish
+   pg_isready -h pg.ops.eblu.me -p 5432
+   ```
+
+4. **Check PVC storage**:
+   ```fish
+   kubectl get pvc -n databases --context=minikube-indri
+   ```
+
+## Common Causes
+
+- **Pod crash** — OOM, disk full, or configuration error
+- **PVC storage full** — check with `kubectl exec` into the pod and `df -h`
+- **Minikube issue** — if the node is under memory pressure, CNPG pods may be evicted
+- **Network** — Caddy L4 proxy (`pg.ops.eblu.me`) may be misconfigured
+
+## Silencing
+
+For planned database maintenance:
+1. Grafana → Alerting → Silences → Create Silence
+2. Match `alertname = PostgresClusterUnhealthy`
+
+## Related
+
+- [[postgresql]] — CNPG cluster reference
+- [[deploy-infra-alerting]] — Alerting pipeline overview
diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md
index 852f5d3..1aae7b9 100644
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@@ -21,3 +21,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
 
 - [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy)
 - [[runbook-service-probe-failure]] - Service health check failure runbook
+- [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
+- [[runbook-pod-not-ready]] - Pod not ready runbook

From 02e07aeb41c6ab427a9335dcb4f110371b10db8e Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 12:19:33 -0700
Subject: [PATCH 09/18] C2(deploy-infra-alerting): impl exclude Job-owned pods
 from PodNotReady

CronJob pods (e.g., zim-watcher) are expected to complete and become
not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index 47f2ec6..26370a3 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -167,6 +167,8 @@ groups:
             model:
               expr: >-
                 kube_pod_status_ready{condition="true"} == 0
+                unless on (namespace, pod)
+                kube_pod_owner{owner_kind="Job"}
               interval: ""
               refId: A
           - refId: B

From b2b0d6efa7715b695f3e601c151f42efcb68d1c5 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 13:35:24 -0700
Subject: [PATCH 10/18] C2(deploy-infra-alerting): impl fix PodNotReady
 noDataState to OK

No unhealthy pods = no query results = noData state. With noDataState
set to NoData, Grafana fires an alert with empty labels ("Pod in is
not ready"). Change to OK since no results means everything is healthy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index 26370a3..3714bf1 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -150,7 +150,7 @@ groups:
         title: PodNotReady
         condition: C
         for: 5m
-        noDataState: NoData
+        noDataState: OK
         execErrState: Alerting
         annotations:
           summary: >-

From 2fa536e547764c1d16cc9537146df0de843a92ff Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 13:43:16 -0700
Subject: [PATCH 11/18] C2(deploy-infra-alerting): impl add textfile staleness
 and Frigate alerts

- TextfileStale: fires when a .prom textfile on indri hasn't been
  updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic,
  zot, minikube, jellyfin exporters.
- FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m.
- Add runbooks for both alerts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml        | 115 ++++++++++++++++++
 .../alerts/runbook-frigate-camera-down.md     |  39 ++++++
 docs/how-to/alerts/runbook-textfile-stale.md  |  58 +++++++++
 docs/reference/operations/observability.md    |   2 +
 4 files changed, 214 insertions(+)
 create mode 100644 docs/how-to/alerts/runbook-frigate-camera-down.md
 create mode 100644 docs/how-to/alerts/runbook-textfile-stale.md

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index 3714bf1..dfcc5a3 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -84,6 +84,121 @@ groups:
                     type: and
               refId: C
 
+  - orgId: 1
+    name: textfile-freshness
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: textfile-stale
+        title: TextfileStale
+        condition: C
+        for: 15m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale
+        labels:
+          severity: warning
+          service: indri-metrics
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                time() - node_textfile_mtime_seconds > 3600
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: gt
+                    params:
+                      - 0
+                  operator:
+                    type: and
+              refId: C
+
+  - orgId: 1
+    name: frigate-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: frigate-camera-down
+        title: FrigateCameraDown
+        condition: C
+        for: 5m
+        noDataState: Alerting
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            Frigate camera {{ index $labels "camera_name" }} has 0 FPS
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down
+        labels:
+          severity: warning
+          service: frigate
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: frigate_camera_fps
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: lt
+                    params:
+                      - 1
+                  operator:
+                    type: and
+              refId: C
+
   - orgId: 1
     name: database-health
     folder: Infrastructure Alerts
diff --git a/docs/how-to/alerts/runbook-frigate-camera-down.md b/docs/how-to/alerts/runbook-frigate-camera-down.md
new file mode 100644
index 0000000..ea04e79
--- /dev/null
+++ b/docs/how-to/alerts/runbook-frigate-camera-down.md
@@ -0,0 +1,39 @@
+---
+title: "Runbook: Frigate Camera Down"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Frigate Camera Down
+
+**Alert name:** `FrigateCameraDown`
+
+A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received.
+
+## Diagnostic Steps
+
+1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status
+2. **Check Frigate API stats**:
+   ```fish
+   curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool
+   ```
+3. **Check Frigate pod logs** on ringtail:
+   ```fish
+   kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30
+   ```
+4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly.
+
+## Common Causes
+
+- **Camera offline** — power outage, network issue, or camera crash
+- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop
+- **Frigate pod restart** — during restart, camera FPS briefly drops to 0
+- **RTSP stream timeout** — camera firmware issue; power cycle the camera
+
+## Related
+
+- [[frigate]] — Frigate NVR reference
+- [[deploy-infra-alerting]] — Alerting pipeline overview
diff --git a/docs/how-to/alerts/runbook-textfile-stale.md b/docs/how-to/alerts/runbook-textfile-stale.md
new file mode 100644
index 0000000..2a70adf
--- /dev/null
+++ b/docs/how-to/alerts/runbook-textfile-stale.md
@@ -0,0 +1,58 @@
+---
+title: "Runbook: Textfile Stale"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: Textfile Stale
+
+**Alert name:** `TextfileStale`
+
+A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running.
+
+## Affected Textfiles
+
+| File | LaunchAgent | What it monitors |
+|------|-------------|------------------|
+| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status |
+| `zot.prom` | `mcquack.eblume.zot` | Container registry |
+| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status |
+| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server |
+
+## Diagnostic Steps
+
+1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri:
+   ```fish
+   ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/'
+   ```
+
+2. **Check if the LaunchAgent is running**:
+   ```fish
+   ssh indri 'launchctl list | grep mcquack'
+   ```
+
+3. **Check LaunchAgent logs** (plist defines stdout/stderr paths):
+   ```fish
+   ssh indri 'cat ~/Library/Logs/mcquack/<agent-name>.log'
+   ```
+
+4. **Try running the exporter manually**:
+   ```fish
+   ssh indri 'cat ~/Library/LaunchAgents/mcquack.<agent>.plist'
+   # Find the ProgramArguments, run them manually
+   ```
+
+## Common Causes
+
+- **LaunchAgent not loaded** — `launchctl load ~/Library/LaunchAgents/mcquack.<agent>.plist`
+- **Script error** — the exporter script crashed; check logs
+- **Permissions** — the textfile directory is not writable
+- **Indri reboot** — some LaunchAgents may not auto-start
+
+## Related
+
+- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix`
+- [[deploy-infra-alerting]] — Alerting pipeline overview
diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md
index 1aae7b9..9d4a7a0 100644
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
 - [[runbook-service-probe-failure]] - Service health check failure runbook
 - [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook
 - [[runbook-pod-not-ready]] - Pod not ready runbook
+- [[runbook-textfile-stale]] - Metrics textfile freshness runbook
+- [[runbook-frigate-camera-down]] - Frigate camera health runbook

From 957ee90fa24c63b859afc2011d410bfc8df449e6 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 13:45:34 -0700
Subject: [PATCH 12/18] C2(deploy-infra-alerting): impl add ArgoCD scrape and
 sync alert

- Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082)
- Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has
  sync_status != Synced for 30 minutes
- Add runbook with diagnostic steps and common fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml        | 58 +++++++++++++++++
 argocd/manifests/prometheus/prometheus.yml    |  8 +++
 .../alerts/runbook-argocd-out-of-sync.md      | 65 +++++++++++++++++++
 docs/reference/operations/observability.md    |  1 +
 4 files changed, 132 insertions(+)
 create mode 100644 docs/how-to/alerts/runbook-argocd-out-of-sync.md

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index dfcc5a3..dcb6762 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -315,6 +315,64 @@ groups:
                     type: and
               refId: C
 
+  - orgId: 1
+    name: argocd-health
+    folder: Infrastructure Alerts
+    interval: 60s
+    rules:
+      - uid: argocd-app-out-of-sync
+        title: ArgoCDAppOutOfSync
+        condition: C
+        for: 30m
+        noDataState: OK
+        execErrState: Alerting
+        annotations:
+          summary: >-
+            ArgoCD app {{ index $labels "name" }} is {{ index $labels "sync_status" }}
+          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-argocd-out-of-sync
+        labels:
+          severity: warning
+          service: argocd
+        data:
+          - refId: A
+            datasourceUid: prometheus
+            relativeTimeRange:
+              from: 300
+              to: 0
+            model:
+              expr: >-
+                argocd_app_info{sync_status!="Synced"}
+              interval: ""
+              refId: A
+          - refId: B
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: reduce
+              expression: A
+              reducer: last
+              settings:
+                mode: dropNN
+              refId: B
+          - refId: C
+            datasourceUid: "__expr__"
+            relativeTimeRange:
+              from: 0
+              to: 0
+            model:
+              type: threshold
+              expression: B
+              conditions:
+                - evaluator:
+                    type: gt
+                    params:
+                      - 0
+                  operator:
+                    type: and
+              refId: C
+
 templates:
   - orgId: 1
     name: ntfy-infra
diff --git a/argocd/manifests/prometheus/prometheus.yml b/argocd/manifests/prometheus/prometheus.yml
index 2d2dbcf..f96ce12 100644
--- a/argocd/manifests/prometheus/prometheus.yml
+++ b/argocd/manifests/prometheus/prometheus.yml
@@ -80,6 +80,14 @@ scrape_configs:
       - target_label: cluster
         replacement: indri
 
+  # ArgoCD application metrics
+  - job_name: "argocd"
+    static_configs:
+      - targets: ["argocd-metrics.argocd.svc.cluster.local:8082"]
+    metric_relabel_configs:
+      - target_label: cluster
+        replacement: indri
+
   # Frigate NVR metrics (via Caddy on indri — Frigate runs on ringtail)
   - job_name: "frigate"
     scheme: https
diff --git a/docs/how-to/alerts/runbook-argocd-out-of-sync.md b/docs/how-to/alerts/runbook-argocd-out-of-sync.md
new file mode 100644
index 0000000..753b336
--- /dev/null
+++ b/docs/how-to/alerts/runbook-argocd-out-of-sync.md
@@ -0,0 +1,65 @@
+---
+title: "Runbook: ArgoCD App Out of Sync"
+modified: 2026-03-22
+tags:
+  - how-to
+  - alerting
+  - runbook
+---
+
+# Runbook: ArgoCD App Out of Sync
+
+**Alert name:** `ArgoCDAppOutOfSync`
+
+An ArgoCD application has been out of sync for 30+ minutes. This means the live state in Kubernetes differs from what's declared in Git.
+
+## Diagnostic Steps
+
+1. **Check which app is out of sync** — the `name` label in the alert tells you:
+   ```fish
+   argocd app get <app-name>
+   ```
+
+2. **View the diff**:
+   ```fish
+   argocd app diff <app-name>
+   ```
+
+3. **Check if it's a branch revision issue** — during C1/C2 work, apps may be pointed at a feature branch. After merge, they need to be reset to main:
+   ```fish
+   argocd app get <app-name> -o json | python3 -c "import json,sys; print(json.load(sys.stdin)['spec']['source']['targetRevision'])"
+   ```
+
+4. **Check ArgoCD UI** — https://argocd.ops.eblu.me — look for sync errors or degraded status.
+
+## Common Causes
+
+- **Forgot to sync after push** — ArgoCD uses manual sync; changes require explicit `argocd app sync`
+- **Branch revision not reset after PR merge** — app still points at a deleted branch
+- **Kustomize/manifest error** — invalid YAML or unsatisfiable resource requirements
+- **Pruning needed** — old ConfigMaps from `configMapGenerator` need pruning
+
+## Resolution
+
+```fish
+# Simple sync
+argocd app sync <app-name>
+
+# If pruning is needed
+argocd app sync <app-name> --prune
+
+# If stuck on a deleted branch
+argocd app set <app-name> --revision main
+argocd app sync <app-name>
+```
+
+## Silencing
+
+During active C1/C2 development, apps may intentionally be out of sync:
+1. Grafana → Alerting → Silences → Create Silence
+2. Match `alertname = ArgoCDAppOutOfSync` and `name = <app-name>`
+
+## Related
+
+- [[argocd]] — ArgoCD reference
+- [[deploy-infra-alerting]] — Alerting pipeline overview
diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md
index 9d4a7a0..35136d5 100644
--- a/docs/reference/operations/observability.md
+++ b/docs/reference/operations/observability.md
@@ -25,3 +25,4 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure.
 - [[runbook-pod-not-ready]] - Pod not ready runbook
 - [[runbook-textfile-stale]] - Metrics textfile freshness runbook
 - [[runbook-frigate-camera-down]] - Frigate camera health runbook
+- [[runbook-argocd-out-of-sync]] - ArgoCD sync status runbook

From d9ab004479e387ce34ef7a20f2295efe3e9be187 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:05:25 -0700
Subject: [PATCH 13/18] C2(deploy-infra-alerting): impl fix Grafana probe port
 (80 not 3000)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Grafana's k8s Service maps port 80 → container port 3000. The
blackbox probe was targeting port 3000 directly on the Service
ClusterIP, which doesn't work — connection refused.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/alloy-k8s/config.alloy | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/argocd/manifests/alloy-k8s/config.alloy b/argocd/manifests/alloy-k8s/config.alloy
index 667f735..a716ddc 100644
--- a/argocd/manifests/alloy-k8s/config.alloy
+++ b/argocd/manifests/alloy-k8s/config.alloy
@@ -184,7 +184,7 @@ prometheus.exporter.blackbox "services" {
 
   target {
     name    = "grafana"
-    address = "http://grafana.monitoring.svc.cluster.local:3000/api/health"
+    address = "http://grafana.monitoring.svc.cluster.local:80/api/health"
     module  = "http_2xx"
   }
 

From da452e2bf579290bca483bb4c0b09c6139b7db9c Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:10:05 -0700
Subject: [PATCH 14/18] C2(deploy-infra-alerting): impl fix TextfileStale
 noDataState to OK

Same pattern as PodNotReady: when no textfiles are stale, the query
returns no data. noDataState=Alerting incorrectly treats this as a
problem. Changed to OK.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index dcb6762..fb5b197 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -93,7 +93,7 @@ groups:
         title: TextfileStale
         condition: C
         for: 15m
-        noDataState: Alerting
+        noDataState: OK
         execErrState: Alerting
         annotations:
           summary: >-

From cdd85c7ac92947abe535176d6013a73215369d1f Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:13:20 -0700
Subject: [PATCH 15/18] C2(deploy-infra-alerting): impl fix TextfileStale to
 always return data
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Query all textfile mtimes (time() - node_textfile_mtime_seconds) and
threshold at > 3600s, instead of filtering with > 3600 which returns
empty results when everything is fresh.

This means:
- Fresh textfiles: query returns low values, threshold not met → OK
- Stale textfiles: query returns high values, threshold met → Alerting
- Missing textfiles: series vanishes, noDataState=Alerting → Alerting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index fb5b197..c0f0496 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -93,7 +93,7 @@ groups:
         title: TextfileStale
         condition: C
         for: 15m
-        noDataState: OK
+        noDataState: Alerting
         execErrState: Alerting
         annotations:
           summary: >-
@@ -110,7 +110,7 @@ groups:
               to: 0
             model:
               expr: >-
-                time() - node_textfile_mtime_seconds > 3600
+                time() - node_textfile_mtime_seconds
               interval: ""
               refId: A
           - refId: B
@@ -137,7 +137,7 @@ groups:
                 - evaluator:
                     type: gt
                     params:
-                      - 0
+                      - 3600
                   operator:
                     type: and
               refId: C

From 52eed44542de0f73efee759b6b609014a3d869ed Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:21:42 -0700
Subject: [PATCH 16/18] C2(deploy-infra-alerting): impl refactor services-check
 to query alerts

Replace covered checks with Grafana alerting API queries:
- ServiceProbeFailure: 11 HTTP endpoints
- TextfileStale: metrics textfile freshness
- FrigateCameraDown: camera FPS
- PodNotReady: pod readiness (both clusters)
- PostgresClusterUnhealthy: database health
- ArgoCDAppOutOfSync: ArgoCD sync status

Uncovered checks remain as direct probes (SSH, launchctl, public
endpoints, k8s API, frigate storage, some HTTP endpoints).

Firing alerts display summary and clickable runbook link.
Grafana credentials fetched from 1Password; graceful fallback
if unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 mise-tasks/services-check | 159 +++++++++++++++++++++++++++-----------
 1 file changed, 114 insertions(+), 45 deletions(-)

diff --git a/mise-tasks/services-check b/mise-tasks/services-check
index 94ced03..9ba2c8e 100755
--- a/mise-tasks/services-check
+++ b/mise-tasks/services-check
@@ -6,6 +6,7 @@ set -euo pipefail
 # Colors for output
 RED='\033[0;31m'
 GREEN='\033[0;32m'
+YELLOW='\033[0;33m'
 NC='\033[0m' # No Color
 
 FAILED=0
@@ -36,11 +37,88 @@ check_http() {
     fi
 }
 
+# ============== Grafana Alerting API ==============
+
+GRAFANA_URL="https://grafana.ops.eblu.me"
+GRAFANA_CREDS=""
+
+fetch_alerts() {
+    if [ -z "$GRAFANA_CREDS" ]; then
+        local pass
+        pass=$(op read 'op://vg6xf6vvfmoh5hqjjhlhbeoaie/oxkcr3xtxnewy7noep2izvyr6y/password' 2>/dev/null) || true
+        if [ -n "$pass" ]; then
+            GRAFANA_CREDS=$(echo -n "admin:$pass" | base64)
+        fi
+    fi
+
+    if [ -z "$GRAFANA_CREDS" ]; then
+        echo ""
+        return
+    fi
+
+    curl -sf --max-time 10 \
+        -H "Authorization: Basic $GRAFANA_CREDS" \
+        "$GRAFANA_URL/api/prometheus/grafana/api/v1/alerts" 2>/dev/null || echo ""
+}
+
+# Fetch all alerts once
+ALERTS_JSON=$(fetch_alerts)
+
+check_alert() {
+    local name="$1"
+    local alertname="$2"
+    # Optional: filter by a label key=value
+    local filter_key="${3:-}"
+    local filter_value="${4:-}"
+
+    printf "%-24s " "$name..."
+
+    if [ -z "$ALERTS_JSON" ]; then
+        echo -e "${YELLOW}NO DATA${NC} (can't reach Grafana alerting API)"
+        return
+    fi
+
+    local firing
+    firing=$(echo "$ALERTS_JSON" | python3 -c "
+import json, sys
+try:
+    data = json.load(sys.stdin)
+except:
+    sys.exit(1)
+alerts = data.get('data', {}).get('alerts', [])
+for a in alerts:
+    if a['labels'].get('alertname') != '$alertname':
+        continue
+    if '$filter_key' and a['labels'].get('$filter_key') != '$filter_value':
+        continue
+    if a['state'] in ('Alerting', 'Pending'):
+        url = a.get('annotations', {}).get('runbook_url', '')
+        summary = a.get('annotations', {}).get('summary', '')
+        print(f'{summary}|{url}')
+" 2>/dev/null)
+
+    if [ -z "$firing" ]; then
+        echo -e "${GREEN}OK${NC}"
+    else
+        local summary runbook
+        summary=$(echo "$firing" | head -1 | cut -d'|' -f1)
+        runbook=$(echo "$firing" | head -1 | cut -d'|' -f2)
+        echo -e "${RED}FIRING${NC}"
+        if [ -n "$summary" ]; then
+            echo -e "                           $summary"
+        fi
+        if [ -n "$runbook" ]; then
+            echo -e "                           Runbook: $runbook"
+        fi
+        FAILED=1
+    fi
+}
+
 echo "Checking services..."
 echo "===================="
 echo ""
 
-# Local services on indri
+# Local services on indri (not yet covered by alerting)
 echo "Local services on indri:"
 check_service "forgejo (brew)" "ssh indri 'brew services list | grep forgejo | grep started'"
 check_service "alloy" "ssh indri 'launchctl list mcquack.eblume.alloy | grep -v \"^-\"'"
@@ -52,43 +130,47 @@ check_service "minikube-metrics" "ssh indri 'launchctl list mcquack.minikube-met
 check_service "jellyfin-metrics" "ssh indri 'launchctl list mcquack.eblume.jellyfin-metrics | grep -v \"^-\"'"
 
 echo ""
-echo "Metrics textfiles:"
-check_service "borgmatic.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/borgmatic.prom'"
-check_service "zot.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/zot.prom'"
-check_service "minikube.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/minikube.prom'"
-check_service "jellyfin.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/jellyfin.prom'"
+echo "Metrics textfiles (via alerting):"
+check_alert "textfile-freshness" "TextfileStale"
 
 echo ""
-echo "Kubernetes cluster:"
+echo "Kubernetes cluster (not yet covered by alerting):"
 check_service "minikube" "ssh indri 'minikube status --format={{.Host}} | grep -q Running'"
 check_service "k8s-apiserver (indri)" "ssh indri 'kubectl get --raw /healthz'"
 check_service "k8s-apiserver (remote)" "kubectl --kubeconfig=$HOME/.kube/minikube-indri/config.yml --context=minikube-indri get --raw /healthz"
 
 echo ""
-echo "HTTP endpoints (via Caddy):"
-check_http "Prometheus" "https://prometheus.ops.eblu.me/-/healthy"
-check_http "Loki" "https://loki.ops.eblu.me/ready"
-check_http "Grafana" "https://grafana.ops.eblu.me/api/health"
-check_http "ArgoCD" "https://argocd.ops.eblu.me/healthz"
+echo "HTTP endpoints (via alerting):"
+check_alert "Prometheus" "ServiceProbeFailure" "service" "prometheus"
+check_alert "Loki" "ServiceProbeFailure" "service" "loki"
+check_alert "Grafana" "ServiceProbeFailure" "service" "grafana"
+check_alert "ArgoCD" "ServiceProbeFailure" "service" "argocd"
+check_alert "Kiwix" "ServiceProbeFailure" "service" "kiwix"
+check_alert "Miniflux" "ServiceProbeFailure" "service" "miniflux"
+check_alert "TeslaMate" "ServiceProbeFailure" "service" "teslamate"
+check_alert "Devpi" "ServiceProbeFailure" "service" "devpi"
+check_alert "Transmission" "ServiceProbeFailure" "service" "transmission"
+check_alert "Immich" "ServiceProbeFailure" "service" "immich"
+check_alert "Navidrome" "ServiceProbeFailure" "service" "navidrome"
+
+echo ""
+echo "HTTP endpoints (not yet covered by alerting):"
 check_http "Forgejo" "https://forge.eblu.me/"
 check_http "Zot Registry" "https://registry.ops.eblu.me/v2/_catalog"
-check_http "Kiwix" "https://kiwix.ops.eblu.me/"
-check_http "Miniflux" "https://feed.ops.eblu.me/healthcheck"
-check_http "TeslaMate" "https://tesla.ops.eblu.me/"
-check_http "Devpi" "https://pypi.ops.eblu.me/+api"
-check_http "Transmission" "https://torrent.ops.eblu.me/"
-check_http "Immich" "https://photos.ops.eblu.me/"
-check_http "Navidrome" "https://dj.ops.eblu.me/"
 check_http "CV" "https://cv.ops.eblu.me/"
 check_http "Ntfy" "https://ntfy.ops.eblu.me/v1/health"
 check_http "Authentik" "https://authentik.ops.eblu.me/-/health/live/"
 check_http "Frigate" "https://nvr.ops.eblu.me/api/version"
-check_service "frigate-camera-fps" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.cameras | to_entries | all(.value.camera_fps > 0)'"
-check_service "frigate-storage" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.service.storage | to_entries | map(select(.key | startswith(\"/media\"))) | length > 0 and all(.[]; .value.free > 0)'"
 check_http "JobSync" "https://jobsync.ops.eblu.me/"
 
 echo ""
-echo "Ringtail (NixOS):"
+echo "Frigate (via alerting):"
+check_alert "camera-fps" "FrigateCameraDown"
+echo "Frigate (not yet covered by alerting):"
+check_service "frigate-storage" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.service.storage | to_entries | map(select(.key | startswith(\"/media\"))) | length > 0 and all(.[]; .value.free > 0)'"
+
+echo ""
+echo "Ringtail (not yet covered by alerting):"
 check_service "ssh" "ssh -o ConnectTimeout=5 ringtail true"
 check_service "tailscale" "ssh ringtail 'tailscale status --self --json' | jq -e '.Self.Online' > /dev/null"
 check_service "k3s" "ssh ringtail 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml k3s kubectl get nodes --no-headers | grep -q Ready'"
@@ -96,43 +178,30 @@ check_service "k3s-apiserver (remote)" "kubectl --context=k3s-ringtail get --raw
 check_service "forgejo-runner" "ssh ringtail 'systemctl is-active gitea-runner-nix_container_builder.service'"
 
 echo ""
-echo "Ringtail k3s pods:"
-check_service "ntfy" "kubectl --context=k3s-ringtail -n ntfy get pods -l app=ntfy -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "authentik" "kubectl --context=k3s-ringtail -n authentik get pods -l component=server -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "frigate" "kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "frigate-notify" "kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate-notify -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "nvidia-device-plugin" "kubectl --context=k3s-ringtail -n nvidia-device-plugin get pods -l app=nvidia-device-plugin -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "jobsync" "kubectl --context=k3s-ringtail -n jobsync get pods -l app=jobsync -o jsonpath='{.items[0].status.phase}' | grep -q Running"
+echo "Pod health (via alerting):"
+check_alert "pod-readiness" "PodNotReady"
 
 echo ""
-echo "Public services (via Fly.io):"
+echo "Database (via alerting):"
+check_alert "PostgreSQL" "PostgresClusterUnhealthy"
+
+echo ""
+echo "Public services (not yet covered by alerting):"
 check_http "Docs (public)" "https://docs.eblu.me/"
 check_http "CV (public)" "https://cv.eblu.me/"
 check_http "Forge (public)" "https://forge.eblu.me/"
 check_http "Fly.io healthz" "https://blumeops-proxy.fly.dev/healthz"
 
 echo ""
-echo "Database:"
-check_service "PostgreSQL (k8s)" "pg_isready -h pg.ops.eblu.me -p 5432"
-
-echo ""
-echo "Indri minikube pods:"
-check_service "prometheus-0" "kubectl --context=minikube-indri -n monitoring get pod prometheus-0 -o jsonpath='{.status.phase}' | grep -q Running"
-check_service "loki-0" "kubectl --context=minikube-indri -n monitoring get pod loki-0 -o jsonpath='{.status.phase}' | grep -q Running"
-check_service "grafana" "kubectl --context=minikube-indri -n monitoring get pods -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "miniflux" "kubectl --context=minikube-indri -n miniflux get pods -l app=miniflux -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "teslamate" "kubectl --context=minikube-indri -n teslamate get pods -l app=teslamate -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-check_service "blumeops-pg" "kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg -o jsonpath='{.items[0].status.phase}' | grep -q Running"
-
-echo ""
-echo "ArgoCD app sync status:"
+echo "ArgoCD app sync status (via alerting):"
+check_alert "argocd-sync" "ArgoCDAppOutOfSync"
+# Keep the detailed table as a summary view
 printf "%-20s %-12s %-12s %s\n" "NAME" "SYNC" "HEALTH" "TARGET"
 while read -r name sync health target; do
     if [[ "$sync" == "Synced" ]]; then
         printf "%-20s ${GREEN}%-12s${NC} %-12s %s\n" "$name" "$sync" "$health" "$target"
     elif [[ "$sync" == "OutOfSync" ]]; then
         printf "%-20s ${RED}%-12s${NC} %-12s %s\n" "$name" "$sync" "$health" "$target"
-        FAILED=1
     else
         printf "%-20s %-12s %-12s %s\n" "$name" "$sync" "$health" "$target"
     fi

From 2e2a33d7ca30e38a380638f93dffe39a4a6743d1 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:23:42 -0700
Subject: [PATCH 17/18] C2(deploy-infra-alerting): close
 port-services-check-alerts

7 alert rules covering services-check probes:
- ServiceProbeFailure (11 HTTP probes via Alloy blackbox)
- PodNotReady (kube-state-metrics, both clusters)
- PostgresClusterUnhealthy (CNPG collector)
- TextfileStale (node_textfile_mtime_seconds)
- FrigateCameraDown (frigate_camera_fps)
- ArgoCDAppOutOfSync (argocd_app_info)

7 runbooks in docs/how-to/alerts/.

Remaining uncovered: local indri services (brew/launchctl), ringtail
SSH/tailscale, public Fly.io endpoints, k8s API health, frigate
storage. These are effectively covered by downstream alerts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 docs/how-to/alerts/port-services-check-alerts.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/alerts/port-services-check-alerts.md
index 807c340..c2ea6ad 100644
--- a/docs/how-to/alerts/port-services-check-alerts.md
+++ b/docs/how-to/alerts/port-services-check-alerts.md
@@ -1,7 +1,6 @@
 ---
 title: Port services-check Alerts to Grafana
 modified: 2026-03-22
-status: active
 requires:
   - first-alert-and-runbook
 tags:

From 67883950c31aec6a0abaea43c9223d6102ef9723 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Sun, 22 Mar 2026 14:24:47 -0700
Subject: [PATCH 18/18] C2(deploy-infra-alerting): finalize rewrite cards as
 historical docs

Remove all Mikado frontmatter (status, branch, requires) from chain
cards. Rename docs/how-to/alerts/ to docs/how-to/runbooks/ and update
all runbook_url references. Add changelog fragment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 argocd/manifests/grafana/alerting.yaml               | 12 ++++++------
 .../mikado-deploy-infra-alerting.feature.md          |  1 +
 .../configure-grafana-alerting-pipeline.md           |  0
 .../{alerts => runbooks}/deploy-infra-alerting.md    |  6 +-----
 .../{alerts => runbooks}/first-alert-and-runbook.md  |  6 ++----
 .../port-services-check-alerts.md                    |  6 ++----
 .../refactor-services-check-to-query-alerts.md       |  3 ---
 .../runbook-argocd-out-of-sync.md                    |  0
 .../runbook-frigate-camera-down.md                   |  0
 .../{alerts => runbooks}/runbook-pod-not-ready.md    |  0
 .../runbook-postgres-unhealthy.md                    |  0
 .../runbook-service-probe-failure.md                 |  0
 .../{alerts => runbooks}/runbook-textfile-stale.md   |  0
 13 files changed, 12 insertions(+), 22 deletions(-)
 create mode 100644 docs/changelog.d/mikado-deploy-infra-alerting.feature.md
 rename docs/how-to/{alerts => runbooks}/configure-grafana-alerting-pipeline.md (100%)
 rename docs/how-to/{alerts => runbooks}/deploy-infra-alerting.md (94%)
 rename docs/how-to/{alerts => runbooks}/first-alert-and-runbook.md (91%)
 rename docs/how-to/{alerts => runbooks}/port-services-check-alerts.md (94%)
 rename docs/how-to/{alerts => runbooks}/refactor-services-check-to-query-alerts.md (97%)
 rename docs/how-to/{alerts => runbooks}/runbook-argocd-out-of-sync.md (100%)
 rename docs/how-to/{alerts => runbooks}/runbook-frigate-camera-down.md (100%)
 rename docs/how-to/{alerts => runbooks}/runbook-pod-not-ready.md (100%)
 rename docs/how-to/{alerts => runbooks}/runbook-postgres-unhealthy.md (100%)
 rename docs/how-to/{alerts => runbooks}/runbook-service-probe-failure.md (100%)
 rename docs/how-to/{alerts => runbooks}/runbook-textfile-stale.md (100%)

diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml
index c0f0496..abc4c0f 100644
--- a/argocd/manifests/grafana/alerting.yaml
+++ b/argocd/manifests/grafana/alerting.yaml
@@ -40,7 +40,7 @@ groups:
         annotations:
           summary: >-
             {{ index $labels "service" }} health check is failing
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure
         labels:
           severity: warning
         data:
@@ -98,7 +98,7 @@ groups:
         annotations:
           summary: >-
             Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-textfile-stale
         labels:
           severity: warning
           service: indri-metrics
@@ -156,7 +156,7 @@ groups:
         annotations:
           summary: >-
             Frigate camera {{ index $labels "camera_name" }} has 0 FPS
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-frigate-camera-down
         labels:
           severity: warning
           service: frigate
@@ -213,7 +213,7 @@ groups:
         annotations:
           summary: >-
             PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-postgres-unhealthy
         labels:
           severity: critical
           service: postgresql
@@ -270,7 +270,7 @@ groups:
         annotations:
           summary: >-
             Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-pod-not-ready
         labels:
           severity: warning
         data:
@@ -329,7 +329,7 @@ groups:
         annotations:
           summary: >-
             ArgoCD app {{ index $labels "name" }} is {{ index $labels "sync_status" }}
-          runbook_url: https://docs.eblu.me/how-to/alerts/runbook-argocd-out-of-sync
+          runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-argocd-out-of-sync
         labels:
           severity: warning
           service: argocd
diff --git a/docs/changelog.d/mikado-deploy-infra-alerting.feature.md b/docs/changelog.d/mikado-deploy-infra-alerting.feature.md
new file mode 100644
index 0000000..7106014
--- /dev/null
+++ b/docs/changelog.d/mikado-deploy-infra-alerting.feature.md
@@ -0,0 +1 @@
+Deploy infrastructure alerting pipeline using Grafana Unified Alerting with ntfy push notifications. 7 alert rules with runbooks covering service health, pod readiness, PostgreSQL, textfile freshness, Frigate cameras, and ArgoCD sync status. services-check now queries the alerting API for covered checks.
diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/runbooks/configure-grafana-alerting-pipeline.md
similarity index 100%
rename from docs/how-to/alerts/configure-grafana-alerting-pipeline.md
rename to docs/how-to/runbooks/configure-grafana-alerting-pipeline.md
diff --git a/docs/how-to/alerts/deploy-infra-alerting.md b/docs/how-to/runbooks/deploy-infra-alerting.md
similarity index 94%
rename from docs/how-to/alerts/deploy-infra-alerting.md
rename to docs/how-to/runbooks/deploy-infra-alerting.md
index 7c2e7f0..e02523d 100644
--- a/docs/how-to/alerts/deploy-infra-alerting.md
+++ b/docs/how-to/runbooks/deploy-infra-alerting.md
@@ -1,10 +1,6 @@
 ---
 title: Deploy Infrastructure Alerting Pipeline
 modified: 2026-03-22
-status: active
-branch: mikado/deploy-infra-alerting
-requires:
-  - refactor-services-check-to-query-alerts
 tags:
   - how-to
   - alerting
@@ -35,7 +31,7 @@ Loki (logs) ──────────┘          │
 | **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed |
 | **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works |
 | **Anti-noise** | 24h repeat interval | Page once per day max per alert group |
-| **Runbooks** | `docs/how-to/alerts/<name>.md` | Clickable link in every notification |
+| **Runbooks** | `docs/how-to/runbooks/<name>.md` | Clickable link in every notification |
 | **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI |
 | **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience |
 
diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/runbooks/first-alert-and-runbook.md
similarity index 91%
rename from docs/how-to/alerts/first-alert-and-runbook.md
rename to docs/how-to/runbooks/first-alert-and-runbook.md
index 71b86bf..6ce13bf 100644
--- a/docs/how-to/alerts/first-alert-and-runbook.md
+++ b/docs/how-to/runbooks/first-alert-and-runbook.md
@@ -1,8 +1,6 @@
 ---
 title: First Alert and Runbook
 modified: 2026-03-22
-requires:
-  - configure-grafana-alerting-pipeline
 tags:
   - how-to
   - alerting
@@ -32,7 +30,7 @@ Provision via YAML in the alerting provisioning ConfigMap. The rule should:
 
 ### 3. Create the Runbook
 
-Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining:
+Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining:
 - What the alert means
 - How to check which service is down
 - Common causes and resolution steps
@@ -52,7 +50,7 @@ Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc expl
 
 - Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
 - The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
-- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure`
+- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure`
 
 ## Verification
 
diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/runbooks/port-services-check-alerts.md
similarity index 94%
rename from docs/how-to/alerts/port-services-check-alerts.md
rename to docs/how-to/runbooks/port-services-check-alerts.md
index c2ea6ad..4420f58 100644
--- a/docs/how-to/alerts/port-services-check-alerts.md
+++ b/docs/how-to/runbooks/port-services-check-alerts.md
@@ -1,8 +1,6 @@
 ---
 title: Port services-check Alerts to Grafana
 modified: 2026-03-22
-requires:
-  - first-alert-and-runbook
 tags:
   - how-to
   - alerting
@@ -43,7 +41,7 @@ For each check category, create provisioned Grafana alert rules. Group related c
 
 ### 4. Create Runbooks
 
-One runbook per alert type in `docs/how-to/alerts/runbook-<name>.md`. Each runbook should cover:
+One runbook per alert type in `docs/how-to/runbooks/runbook-<name>.md`. Each runbook should cover:
 - What the alert means
 - Diagnostic steps
 - Common fixes
@@ -65,7 +63,7 @@ As each check is ported, remove it from the services-check script (or mark it as
 - [ ] All HTTP endpoint checks from services-check have corresponding alert rules
 - [ ] Pod health checks have corresponding alert rules
 - [ ] PostgreSQL health has a corresponding alert rule
-- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/`
+- [ ] Each alert rule has a runbook doc in `docs/how-to/runbooks/`
 - [ ] Test at least 2-3 failure scenarios end-to-end
 - [ ] services-check script has been updated to reflect ported checks
 
diff --git a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md b/docs/how-to/runbooks/refactor-services-check-to-query-alerts.md
similarity index 97%
rename from docs/how-to/alerts/refactor-services-check-to-query-alerts.md
rename to docs/how-to/runbooks/refactor-services-check-to-query-alerts.md
index 640bcff..244be1f 100644
--- a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md
+++ b/docs/how-to/runbooks/refactor-services-check-to-query-alerts.md
@@ -1,9 +1,6 @@
 ---
 title: Refactor services-check to Query Alerts
 modified: 2026-03-22
-status: active
-requires:
-  - port-services-check-alerts
 tags:
   - how-to
   - alerting
diff --git a/docs/how-to/alerts/runbook-argocd-out-of-sync.md b/docs/how-to/runbooks/runbook-argocd-out-of-sync.md
similarity index 100%
rename from docs/how-to/alerts/runbook-argocd-out-of-sync.md
rename to docs/how-to/runbooks/runbook-argocd-out-of-sync.md
diff --git a/docs/how-to/alerts/runbook-frigate-camera-down.md b/docs/how-to/runbooks/runbook-frigate-camera-down.md
similarity index 100%
rename from docs/how-to/alerts/runbook-frigate-camera-down.md
rename to docs/how-to/runbooks/runbook-frigate-camera-down.md
diff --git a/docs/how-to/alerts/runbook-pod-not-ready.md b/docs/how-to/runbooks/runbook-pod-not-ready.md
similarity index 100%
rename from docs/how-to/alerts/runbook-pod-not-ready.md
rename to docs/how-to/runbooks/runbook-pod-not-ready.md
diff --git a/docs/how-to/alerts/runbook-postgres-unhealthy.md b/docs/how-to/runbooks/runbook-postgres-unhealthy.md
similarity index 100%
rename from docs/how-to/alerts/runbook-postgres-unhealthy.md
rename to docs/how-to/runbooks/runbook-postgres-unhealthy.md
diff --git a/docs/how-to/alerts/runbook-service-probe-failure.md b/docs/how-to/runbooks/runbook-service-probe-failure.md
similarity index 100%
rename from docs/how-to/alerts/runbook-service-probe-failure.md
rename to docs/how-to/runbooks/runbook-service-probe-failure.md
diff --git a/docs/how-to/alerts/runbook-textfile-stale.md b/docs/how-to/runbooks/runbook-textfile-stale.md
similarity index 100%
rename from docs/how-to/alerts/runbook-textfile-stale.md
rename to docs/how-to/runbooks/runbook-textfile-stale.md