From 1d5990a2f704370351b7dcb7e8e188a398be782a Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 10:28:31 -0700 Subject: [PATCH 01/18] C2(deploy-infra-alerting): plan add alerting pipeline cards MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mikado chain for deploying Grafana Unified Alerting with ntfy notifications, replacing manual services-check probes. Chain: configure-grafana-alerting-pipeline → first-alert-and-runbook → port-services-check-alerts → refactor-services-check-to-query-alerts → deploy-infra-alerting (goal) Co-Authored-By: Claude Opus 4.6 (1M context) --- .../configure-grafana-alerting-pipeline.md | 60 ++++++++++++++ docs/how-to/alerts/deploy-infra-alerting.md | 81 +++++++++++++++++++ docs/how-to/alerts/first-alert-and-runbook.md | 70 ++++++++++++++++ .../alerts/port-services-check-alerts.md | 77 ++++++++++++++++++ ...refactor-services-check-to-query-alerts.md | 56 +++++++++++++ 5 files changed, 344 insertions(+) create mode 100644 docs/how-to/alerts/configure-grafana-alerting-pipeline.md create mode 100644 docs/how-to/alerts/deploy-infra-alerting.md create mode 100644 docs/how-to/alerts/first-alert-and-runbook.md create mode 100644 docs/how-to/alerts/port-services-check-alerts.md create mode 100644 docs/how-to/alerts/refactor-services-check-to-query-alerts.md diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md new file mode 100644 index 0000000..2c6999b --- /dev/null +++ b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md @@ -0,0 +1,60 @@ +--- +title: Configure Grafana Alerting Pipeline +modified: 2026-03-22 +status: active +tags: + - how-to + - alerting + - grafana +--- + +# Configure Grafana Alerting Pipeline + +Enable Grafana Unified Alerting, create an ntfy webhook contact point, configure the notification policy with anti-noise settings, and set up a message template with runbook links. + +## What to Do + +### 1. Enable Unified Alerting in grafana.ini + +Add the `[unified_alerting]` section to the Grafana ConfigMap. Grafana 11+ has unified alerting enabled by default, but we should be explicit and configure the evaluation interval. + +### 2. Create Alerting Provisioning Files + +Grafana supports provisioning alert resources via YAML files in `/etc/grafana/provisioning/alerting/`. Create: + +- **Contact point** — ntfy webhook targeting `http://ntfy.ntfy.svc.cluster.local:80/infra-alerts` (cluster-internal, since Grafana and ntfy are on different clusters, use `ntfy.ops.eblu.me` via Caddy instead) +- **Notification policy** — root policy with `group_wait: 1m`, `group_interval: 12h`, `repeat_interval: 24h`, grouped by `alertname` and `service` +- **Message template** — format that includes alert name, summary, and a clickable runbook URL as an ntfy action button + +### 3. Mount Provisioning into Grafana + +Add the alerting provisioning ConfigMap to the Grafana deployment, mounted at `/etc/grafana/provisioning/alerting/`. + +### 4. Create the `infra-alerts` Topic + +ntfy topics are created on first publish — no explicit setup needed. But verify that the topic works by sending a test notification. + +### 5. Verify End-to-End + +- Grafana UI shows the ntfy contact point under Alerting → Contact Points +- Notification policy shows the anti-noise settings +- Test notification from Grafana reaches the ntfy iOS app + +## Key Details + +- Grafana runs on minikube (indri), ntfy runs on k3s (ringtail). The contact point URL must go through Caddy: `https://ntfy.ops.eblu.me/infra-alerts` +- ntfy action buttons use the `X-Actions` header or JSON body format: `view, Open Runbook, ` +- Grafana provisioning files are applied on startup and cannot be edited from the UI (which is what we want for GitOps) + +## Verification + +- [ ] Grafana starts with unified alerting enabled +- [ ] Contact point `ntfy-infra` visible in Grafana UI +- [ ] Notification policy shows correct group/repeat intervals +- [ ] Test notification arrives on iOS via ntfy app +- [ ] Test notification includes a clickable runbook link + +## Related + +- [[deploy-infra-alerting]] — Parent goal +- [[first-alert-and-runbook]] — Next: create the first real alert diff --git a/docs/how-to/alerts/deploy-infra-alerting.md b/docs/how-to/alerts/deploy-infra-alerting.md new file mode 100644 index 0000000..7c2e7f0 --- /dev/null +++ b/docs/how-to/alerts/deploy-infra-alerting.md @@ -0,0 +1,81 @@ +--- +title: Deploy Infrastructure Alerting Pipeline +modified: 2026-03-22 +status: active +branch: mikado/deploy-infra-alerting +requires: + - refactor-services-check-to-query-alerts +tags: + - how-to + - alerting + - observability +--- + +# Deploy Infrastructure Alerting Pipeline + +Replace the manual `mise run services-check` approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links. + +## Architecture + +``` +Prometheus (metrics) ──┐ + ├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push +Loki (logs) ──────────┘ │ + │ + Notification Policy + (group_wait: 1m, + group_interval: 12h, + repeat_interval: 24h) +``` + +## Design Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed | +| **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works | +| **Anti-noise** | 24h repeat interval | Page once per day max per alert group | +| **Runbooks** | `docs/how-to/alerts/.md` | Clickable link in every notification | +| **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI | +| **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience | + +## Alerting Policy + +- Each alert fires **once** and does not re-notify for 24 hours +- A "resolved" notification is sent when the condition clears +- Every alert annotation includes `runbook_url` linking to its how-to doc +- The ntfy message template renders the runbook URL as a clickable action button +- Alerts are grouped by service to avoid notification storms + +## Migration Path + +1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template +2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure) +3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook +4. Refactor services-check to query the Grafana alerting API instead of doing its own probes + +## What services-check Covers Today + +These checks will be migrated to alert rules: + +| Category | Checks | Data Source | +|----------|--------|-------------| +| Local services (indri) | forgejo, alloy, borgmatic, zot via brew/launchctl | Need new probes or textfile metrics | +| Metrics textfiles | freshness of `.prom` files | Existing node_textfile metrics | +| K8s cluster health | minikube API, k3s API | kube-state-metrics | +| HTTP endpoints | ~12 services via Caddy | Alloy blackbox exporter (already exists) | +| Ringtail | SSH, tailscale, k3s health | Need new probes | +| K3s pods | ntfy, authentik, frigate, etc. | kube-state-metrics on ringtail | +| Public services | docs, cv, forge via Fly.io | Alloy on Fly.io or external probe | +| PostgreSQL | CNPG readiness | CNPG metrics (already scraped) | +| ArgoCD sync | app sync/health status | ArgoCD metrics or API | + +## Related + +- [[configure-grafana-alerting-pipeline]] — Foundation: contact point, policy, template +- [[first-alert-and-runbook]] — Proof of concept alert +- [[port-services-check-alerts]] — Systematic migration +- [[refactor-services-check-to-query-alerts]] — Final integration +- [[observability]] — Current observability stack +- [[ntfy]] — Push notification service +- [[grafana]] — Dashboard and alerting platform diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/alerts/first-alert-and-runbook.md new file mode 100644 index 0000000..bec5aaa --- /dev/null +++ b/docs/how-to/alerts/first-alert-and-runbook.md @@ -0,0 +1,70 @@ +--- +title: First Alert and Runbook +modified: 2026-03-22 +status: active +requires: + - configure-grafana-alerting-pipeline +tags: + - how-to + - alerting +--- + +# First Alert and Runbook + +Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc. + +## What to Do + +### 1. Choose the First Alert + +The best candidate is a **blackbox probe failure** because: +- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals +- The metric `probe_success` is already in Prometheus +- It maps directly to what services-check does (HTTP health checks) +- A single alert rule with a `service` label can cover all probed services + +### 2. Create the Alert Rule + +Provision via YAML in the alerting provisioning ConfigMap. The rule should: +- Query `probe_success == 0` from Prometheus +- Fire after the condition persists for 2 minutes (avoid flapping) +- Include labels: `severity: warning`, `service: {{ $labels.instance }}` +- Include annotations: `summary`, `runbook_url` pointing to the runbook doc + +### 3. Create the Runbook + +Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining: +- What the alert means +- How to check which service is down +- Common causes and resolution steps +- How to silence the alert if the downtime is planned + +### 4. Verify End-to-End + +- Stop one of the probed services (e.g., scale miniflux to 0) +- Wait for the alert to fire (~2 minutes) +- Confirm ntfy notification arrives with correct summary and runbook link +- Click the runbook link and verify it reaches docs.eblu.me +- Scale the service back up +- Confirm "resolved" notification arrives +- Confirm no repeat notification during the 24h window + +## Key Details + +- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies +- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name +- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure` + +## Verification + +- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules +- [ ] Simulated failure triggers ntfy notification within ~3 minutes +- [ ] Notification includes service name, summary, and clickable runbook link +- [ ] Resolution triggers a "resolved" notification +- [ ] No repeat notification within 24h window + +## Related + +- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working +- [[deploy-infra-alerting]] — Parent goal +- [[port-services-check-alerts]] — Next: port remaining checks diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/alerts/port-services-check-alerts.md new file mode 100644 index 0000000..807c340 --- /dev/null +++ b/docs/how-to/alerts/port-services-check-alerts.md @@ -0,0 +1,77 @@ +--- +title: Port services-check Alerts to Grafana +modified: 2026-03-22 +status: active +requires: + - first-alert-and-runbook +tags: + - how-to + - alerting +--- + +# Port services-check Alerts to Grafana + +Systematically migrate the health checks from `mise run services-check` to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today. + +## What to Do + +### 1. Inventory and Prioritize + +Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation. + +**Already have metrics (easy):** +- HTTP endpoint probes → Alloy blackbox exporter (`probe_success`) +- PostgreSQL health → CNPG metrics (`cnpg_pg_replication_streaming`, `cnpg_collector_up`) +- K8s pod health → kube-state-metrics (`kube_pod_status_phase`) +- ArgoCD sync status → ArgoCD metrics (`argocd_app_info` with sync/health labels) + +**Need new probes or metrics:** +- Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes +- Metrics textfile freshness → `node_textfile_mtime_seconds` (already collected by Alloy on indri) +- Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe +- Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring + +### 2. Add Missing Probes + +Extend Alloy configurations where needed: +- **Alloy on indri:** Add blackbox targets for forgejo, zot (local HTTP endpoints) +- **Alloy on ringtail:** Add blackbox targets for ringtail-local services +- **Consider:** Whether public endpoint probing belongs in Fly.io Alloy or a separate prober + +### 3. Create Alert Rules + +For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., "indri-services", "k8s-health", "public-endpoints"). + +### 4. Create Runbooks + +One runbook per alert type in `docs/how-to/alerts/runbook-.md`. Each runbook should cover: +- What the alert means +- Diagnostic steps +- Common fixes +- How to silence for planned maintenance + +### 5. Remove from services-check + +As each check is ported, remove it from the services-check script (or mark it as "now handled by alerting"). The goal is that services-check shrinks as alerting grows. + +## Key Details + +- Don't try to port everything in one session — this card may span multiple work cycles within the C2 chain +- Prioritize checks that have caught real problems in the past +- Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases +- The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward + +## Verification + +- [ ] All HTTP endpoint checks from services-check have corresponding alert rules +- [ ] Pod health checks have corresponding alert rules +- [ ] PostgreSQL health has a corresponding alert rule +- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/` +- [ ] Test at least 2-3 failure scenarios end-to-end +- [ ] services-check script has been updated to reflect ported checks + +## Related + +- [[first-alert-and-runbook]] — Prerequisite: established the pattern +- [[deploy-infra-alerting]] — Parent goal +- [[refactor-services-check-to-query-alerts]] — Next: make services-check query alerts diff --git a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md new file mode 100644 index 0000000..640bcff --- /dev/null +++ b/docs/how-to/alerts/refactor-services-check-to-query-alerts.md @@ -0,0 +1,56 @@ +--- +title: Refactor services-check to Query Alerts +modified: 2026-03-22 +status: active +requires: + - port-services-check-alerts +tags: + - how-to + - alerting +--- + +# Refactor services-check to Query Alerts + +Change `mise run services-check` from doing its own health probes to querying the Grafana alerting API for currently firing alerts. The script becomes a CLI view into the same alerting system that sends ntfy notifications. + +## What to Do + +### 1. Query the Grafana Alerting API + +Grafana exposes alert state via: +- `GET /api/v1/provisioning/alert-rules` — all configured rules +- `GET /api/prometheus/grafana/api/v1/alerts` — currently firing alerts (Prometheus-compatible format) + +The second endpoint is simpler — it returns only active alerts with labels and annotations, similar to Alertmanager's `/api/v1/alerts`. + +### 2. Rewrite services-check + +The new services-check should: +1. Query the Grafana alerting API for firing alerts +2. Display them in a table with service name, alert name, duration, and runbook link +3. If no alerts are firing, print a green "all clear" message +4. Exit 0 if no alerts, exit 1 if any are firing +5. Optionally keep a few checks that don't map to alerting (e.g., the ArgoCD sync status table as a summary view) + +### 3. Handle Authentication + +services-check will need a Grafana API token or service account token. Options: +- Use the existing Grafana admin credentials from 1Password (`op read`) +- Create a dedicated read-only service account in Grafana + +### 4. Preserve the ArgoCD Summary + +The ArgoCD sync/health table in services-check is a useful quick view even when nothing is alerting. Consider keeping it as a separate section that always displays, independent of the alert query. + +## Verification + +- [ ] `mise run services-check` queries Grafana instead of doing direct probes +- [ ] Firing alerts are displayed with service name, alert name, and runbook link +- [ ] Exit code reflects alert state (0 = clear, 1 = firing) +- [ ] Works when Grafana is unreachable (graceful error, not a crash) +- [ ] ArgoCD summary table still works + +## Related + +- [[port-services-check-alerts]] — Prerequisite: alerts must exist to query +- [[deploy-infra-alerting]] — Parent goal From 261f20601a289a12e8df680dc75324fbe4ed9d4a Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 10:35:36 -0700 Subject: [PATCH 02/18] C2(deploy-infra-alerting): impl configure grafana alerting pipeline - Enable unified alerting in grafana.ini - Create alerting.yaml provisioning file with: - ntfy-infra webhook contact point (POST to ntfy.ops.eblu.me/infra-alerts) - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h - Message templates for title and runbook links - Mount alerting provisioning into Grafana deployment - Add alerting.yaml to kustomization configMapGenerator Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 42 +++++++++++++++++++++ argocd/manifests/grafana/deployment.yaml | 3 ++ argocd/manifests/grafana/grafana.ini | 5 +++ argocd/manifests/grafana/kustomization.yaml | 1 + 4 files changed, 51 insertions(+) create mode 100644 argocd/manifests/grafana/alerting.yaml diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml new file mode 100644 index 0000000..3ac33b0 --- /dev/null +++ b/argocd/manifests/grafana/alerting.yaml @@ -0,0 +1,42 @@ +apiVersion: 1 + +contactPoints: + - orgId: 1 + name: ntfy-infra + receivers: + - uid: ntfy-infra-webhook + type: webhook + settings: + url: https://ntfy.ops.eblu.me/infra-alerts + httpMethod: POST + title: >- + {{ template "ntfy-infra.title" . }} + message: >- + {{ template "ntfy-infra.message" . }} + maxAlerts: "0" + disableResolveMessage: false + +policies: + - orgId: 1 + receiver: ntfy-infra + group_by: + - alertname + - service + group_wait: 1m + group_interval: 12h + repeat_interval: 24h + +templates: + - orgId: 1 + name: ntfy-infra + template: | + {{ define "ntfy-infra.title" -}} + [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} + {{- end }} + + {{ define "ntfy-infra.message" -}} + {{ range .Alerts -}} + {{ .Annotations.summary }} + {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }} + {{ end -}} + {{- end }} diff --git a/argocd/manifests/grafana/deployment.yaml b/argocd/manifests/grafana/deployment.yaml index 61a2f88..5fbb8eb 100644 --- a/argocd/manifests/grafana/deployment.yaml +++ b/argocd/manifests/grafana/deployment.yaml @@ -277,6 +277,9 @@ spec: - name: config mountPath: /etc/grafana/provisioning/datasources/datasources.yaml subPath: datasources.yaml + - name: config + mountPath: /etc/grafana/provisioning/alerting/alerting.yaml + subPath: alerting.yaml - name: storage mountPath: /var/lib/grafana - name: sc-dashboard-volume diff --git a/argocd/manifests/grafana/grafana.ini b/argocd/manifests/grafana/grafana.ini index 61cdd7e..a0a6db8 100644 --- a/argocd/manifests/grafana/grafana.ini +++ b/argocd/manifests/grafana/grafana.ini @@ -30,3 +30,8 @@ allow_embedding = false [server] root_url = https://grafana.ops.eblu.me + +[unified_alerting] +enabled = true +evaluation_timeout = 30s +min_interval = 10s diff --git a/argocd/manifests/grafana/kustomization.yaml b/argocd/manifests/grafana/kustomization.yaml index c052bf9..3aeaa26 100644 --- a/argocd/manifests/grafana/kustomization.yaml +++ b/argocd/manifests/grafana/kustomization.yaml @@ -25,6 +25,7 @@ configMapGenerator: files: - grafana.ini - datasources.yaml + - alerting.yaml options: labels: app.kubernetes.io/name: grafana From c1acc808d5d943cc33fc94c82ab130b7bc360970 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 10:51:02 -0700 Subject: [PATCH 03/18] C2(deploy-infra-alerting): close configure-grafana-alerting-pipeline Pipeline verified: - Grafana unified alerting enabled with provisioned contact point and policy - ntfy webhook contact point delivering to infra-alerts topic - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h - iOS push notifications confirmed working Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/how-to/alerts/configure-grafana-alerting-pipeline.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md index 2c6999b..eb90128 100644 --- a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md +++ b/docs/how-to/alerts/configure-grafana-alerting-pipeline.md @@ -1,7 +1,6 @@ --- title: Configure Grafana Alerting Pipeline modified: 2026-03-22 -status: active tags: - how-to - alerting From 549c57ab82127aaa1e6ee4b6db57f5e8bf504355 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 10:57:23 -0700 Subject: [PATCH 04/18] C2(deploy-infra-alerting): impl add first alert rule and runbook - Add ServiceProbeFailure alert rule to Grafana alerting provisioning - Queries probe_success metric from Alloy blackbox exporter - Extracts service name from job label via label_replace - Fires after 2 minutes of failure, noDataState=Alerting - Annotations include summary with service name and runbook URL - Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md - Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd) - Diagnostic steps, common causes, silencing instructions - Add alerting section to observability.md reference doc Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 49 ++++++++++++ .../alerts/runbook-service-probe-failure.md | 75 +++++++++++++++++++ docs/reference/operations/observability.md | 7 +- 3 files changed, 130 insertions(+), 1 deletion(-) create mode 100644 docs/how-to/alerts/runbook-service-probe-failure.md diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 3ac33b0..3fe4b1c 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -26,6 +26,55 @@ policies: group_interval: 12h repeat_interval: 24h +groups: + - orgId: 1 + name: service-health + folder: Infrastructure Alerts + interval: 30s + rules: + - uid: service-probe-failure + title: ServiceProbeFailure + condition: B + for: 2m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + {{ index $labels "service" }} health check is failing + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure + labels: + severity: warning + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: >- + label_replace(probe_success, "service", + "$1", "job", "integrations/blackbox/(.*)") + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: A + conditions: + - evaluator: + type: lt + params: + - 1 + operator: + type: and + reducer: + type: last + refId: B + templates: - orgId: 1 name: ntfy-infra diff --git a/docs/how-to/alerts/runbook-service-probe-failure.md b/docs/how-to/alerts/runbook-service-probe-failure.md new file mode 100644 index 0000000..575606e --- /dev/null +++ b/docs/how-to/alerts/runbook-service-probe-failure.md @@ -0,0 +1,75 @@ +--- +title: "Runbook: Service Probe Failure" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Service Probe Failure + +**Alert name:** `ServiceProbeFailure` + +A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint. + +## Affected Services + +This alert covers services probed by the Alloy blackbox exporter on indri's minikube cluster: + +| Service | Health Endpoint | +|---------|----------------| +| miniflux | `/healthcheck` | +| kiwix | `/` | +| transmission | `/transmission/web/` | +| devpi | `/+api` | +| argocd | `/healthz` | + +The failing service is identified by the `service` label in the alert (extracted from the `job` label). + +## Diagnostic Steps + +1. **Check which service is down** — the alert label `service` tells you. You can also run: + ```fish + kubectl get pods -n --context=minikube-indri + ``` + +2. **Check pod status** — look for CrashLoopBackOff, OOMKilled, or pending pods: + ```fish + kubectl describe pod -n --context=minikube-indri + ``` + +3. **Check pod logs**: + ```fish + kubectl logs -n --context=minikube-indri --tail=50 + ``` + +4. **Check if minikube itself is healthy**: + ```fish + ssh indri 'minikube status' + ``` + +5. **Check NFS mounts** (kiwix, transmission depend on sifaka NFS): + ```fish + ssh indri 'df -h | grep Volumes' + ``` + +## Common Causes + +- **Pod crashed** — check logs, restart with `kubectl delete pod` +- **NFS mount lost** — sifaka offline or AutoMounter not running. SSH to indri and check `/Volumes/` +- **Resource exhaustion** — check `kubectl top pods -n ` for memory/CPU pressure +- **Minikube paused/stopped** — `ssh indri 'minikube status'`, restart if needed + +## Silencing + +For planned maintenance, silence this alert in Grafana: +1. Go to Alerting → Silences → Create Silence +2. Match label `alertname = ServiceProbeFailure` +3. Optionally match `service = ` to silence only one +4. Set duration for your maintenance window + +## Related + +- [[deploy-infra-alerting]] — Alerting pipeline overview +- [[configure-grafana-alerting-pipeline]] — Pipeline configuration diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 5890147..852f5d3 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -1,6 +1,6 @@ --- title: Observability -modified: 2026-02-07 +modified: 2026-03-22 tags: - operations --- @@ -16,3 +16,8 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure. - [[tempo]] - Distributed tracing - [[alloy|Alloy]] - Metrics, log, and trace collection - [[grafana]] - Dashboards and visualization + +## Alerting + +- [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy) +- [[runbook-service-probe-failure]] - Service health check failure runbook From 94413f73ba4c74d05d789076569d9a7db71b25a8 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 11:00:21 -0700 Subject: [PATCH 05/18] C2(deploy-infra-alerting): impl fix alert rule multi-series evaluation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add reduce step between Prometheus query and threshold to preserve per-service labels. Without it, Grafana can't distinguish the 5 probe_success series and errors with "duplicate results with labels {}". Chain: A (prometheus query) → B (reduce last) → C (threshold < 1) Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 3fe4b1c..a05ddee 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -34,7 +34,7 @@ groups: rules: - uid: service-probe-failure title: ServiceProbeFailure - condition: B + condition: C for: 2m noDataState: Alerting execErrState: Alerting @@ -62,8 +62,20 @@ groups: from: 0 to: 0 model: - type: threshold + type: reduce expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B conditions: - evaluator: type: lt @@ -71,9 +83,7 @@ groups: - 1 operator: type: and - reducer: - type: last - refId: B + refId: C templates: - orgId: 1 From 4c0bd0055f3855c19544e1f96296020d97651f23 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 11:59:26 -0700 Subject: [PATCH 06/18] C2(deploy-infra-alerting): impl use custom payload for ntfy-native JSON Use the correct provisioning field name for Grafana webhook custom payloads: settings.payload.template (not payloadTemplate). Found by reading the Go source (grafana/alerting receivers/webhook/v1/config.go): Payload *CustomPayload `json:"payload,omitempty"` CustomPayload.Template string `json:"template,omitempty"` The template uses coll.Dict, coll.Append, and data.ToJSON to produce ntfy-native JSON with topic, title, message, priority, and action buttons linking to runbooks. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 31 ++++++++++++++------------ 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index a05ddee..a190039 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -7,13 +7,12 @@ contactPoints: - uid: ntfy-infra-webhook type: webhook settings: - url: https://ntfy.ops.eblu.me/infra-alerts + url: https://ntfy.ops.eblu.me httpMethod: POST - title: >- - {{ template "ntfy-infra.title" . }} - message: >- - {{ template "ntfy-infra.message" . }} maxAlerts: "0" + payload: + template: >- + {{ template "ntfy-infra.payload" . }} disableResolveMessage: false policies: @@ -89,13 +88,17 @@ templates: - orgId: 1 name: ntfy-infra template: | - {{ define "ntfy-infra.title" -}} - [{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} - {{- end }} - - {{ define "ntfy-infra.message" -}} - {{ range .Alerts -}} - {{ .Annotations.summary }} - {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }} - {{ end -}} + {{ define "ntfy-infra.payload" -}} + {{- $msg := "" -}} + {{- range .Alerts -}} + {{- $msg = (printf "%s%s\n" $msg .Annotations.summary) -}} + {{- end -}} + {{- $title := (printf "[%s] %s" (.Status | toUpper) .CommonLabels.alertname) -}} + {{- $actions := coll.Slice -}} + {{- range .Alerts -}} + {{- if .Annotations.runbook_url -}} + {{- $actions = coll.Append (coll.Dict "action" "view" "label" "Open Runbook" "url" .Annotations.runbook_url) $actions -}} + {{- end -}} + {{- end -}} + {{- coll.Dict "topic" "infra-alerts" "title" $title "message" $msg "priority" 3 "actions" $actions | data.ToJSON -}} {{- end }} From e33b0bc184b93a5acac68a2b877613229ae6d26b Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 12:05:43 -0700 Subject: [PATCH 07/18] C2(deploy-infra-alerting): close first-alert-and-runbook End-to-end alerting pipeline verified: - ServiceProbeFailure alert rule evaluates 5 blackbox probes - Grafana custom payload produces ntfy-native JSON (topic, title, message, priority, actions) - Firing notification arrives on iOS with clean formatting - "Open Runbook" action button links to docs.eblu.me runbook - Resolved notification delivered on recovery Key discovery: Grafana webhook custom payload provisioning field is settings.payload.template (nested object), not payloadTemplate (flat string). Found by reading grafana/alerting Go source. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/how-to/alerts/first-alert-and-runbook.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/alerts/first-alert-and-runbook.md index bec5aaa..71b86bf 100644 --- a/docs/how-to/alerts/first-alert-and-runbook.md +++ b/docs/how-to/alerts/first-alert-and-runbook.md @@ -1,7 +1,6 @@ --- title: First Alert and Runbook modified: 2026-03-22 -status: active requires: - configure-grafana-alerting-pipeline tags: @@ -68,3 +67,4 @@ Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc expl - [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working - [[deploy-infra-alerting]] — Parent goal - [[port-services-check-alerts]] — Next: port remaining checks +- [[runbook-service-probe-failure]] — The runbook created for this alert From 8e6a803076cddd89d8d5e0867e3319415800a80f Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 12:11:12 -0700 Subject: [PATCH 08/18] C2(deploy-infra-alerting): impl add probes and alert rules for services-check coverage Extend Alloy blackbox probes: - Add prometheus, loki, grafana, teslamate, immich, navidrome - Now probing 11 services (was 5), covering most HTTP checks from services-check Add alert rules: - PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical) - PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m Add runbooks: - runbook-postgres-unhealthy.md - runbook-pod-not-ready.md Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/alloy-k8s/config.alloy | 37 ++++++ argocd/manifests/grafana/alerting.yaml | 114 ++++++++++++++++++ docs/how-to/alerts/runbook-pod-not-ready.md | 55 +++++++++ .../alerts/runbook-postgres-unhealthy.md | 63 ++++++++++ docs/reference/operations/observability.md | 2 + 5 files changed, 271 insertions(+) create mode 100644 docs/how-to/alerts/runbook-pod-not-ready.md create mode 100644 docs/how-to/alerts/runbook-postgres-unhealthy.md diff --git a/argocd/manifests/alloy-k8s/config.alloy b/argocd/manifests/alloy-k8s/config.alloy index c169c93..667f735 100644 --- a/argocd/manifests/alloy-k8s/config.alloy +++ b/argocd/manifests/alloy-k8s/config.alloy @@ -169,6 +169,43 @@ prometheus.exporter.blackbox "services" { address = "http://argocd-server.argocd.svc.cluster.local:80/healthz" module = "http_2xx" } + + target { + name = "prometheus" + address = "http://prometheus.monitoring.svc.cluster.local:9090/-/healthy" + module = "http_2xx" + } + + target { + name = "loki" + address = "http://loki.monitoring.svc.cluster.local:3100/ready" + module = "http_2xx" + } + + target { + name = "grafana" + address = "http://grafana.monitoring.svc.cluster.local:3000/api/health" + module = "http_2xx" + } + + target { + name = "teslamate" + address = "http://teslamate.teslamate.svc.cluster.local:4000/" + module = "http_2xx" + } + + target { + name = "immich" + address = "http://immich-server.immich.svc.cluster.local:2283/api/server/ping" + module = "http_2xx" + } + + target { + name = "navidrome" + address = "http://navidrome.navidrome.svc.cluster.local:4533/" + module = "http_2xx" + } + } // Scrape blackbox probe results diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index a190039..47f2ec6 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -84,6 +84,120 @@ groups: type: and refId: C + - orgId: 1 + name: database-health + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: postgres-cluster-unhealthy + title: PostgresClusterUnhealthy + condition: C + for: 3m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy + labels: + severity: critical + service: postgresql + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: cnpg_collector_up + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: lt + params: + - 1 + operator: + type: and + refId: C + + - orgId: 1 + name: pod-health + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: pod-not-ready + title: PodNotReady + condition: C + for: 5m + noDataState: NoData + execErrState: Alerting + annotations: + summary: >- + Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready + labels: + severity: warning + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: >- + kube_pod_status_ready{condition="true"} == 0 + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: lt + params: + - 1 + operator: + type: and + refId: C + templates: - orgId: 1 name: ntfy-infra diff --git a/docs/how-to/alerts/runbook-pod-not-ready.md b/docs/how-to/alerts/runbook-pod-not-ready.md new file mode 100644 index 0000000..49dd35e --- /dev/null +++ b/docs/how-to/alerts/runbook-pod-not-ready.md @@ -0,0 +1,55 @@ +--- +title: "Runbook: Pod Not Ready" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Pod Not Ready + +**Alert name:** `PodNotReady` + +A Kubernetes pod has been in a not-ready state for 5+ minutes. + +## Diagnostic Steps + +1. **Identify the pod** from the alert labels (`pod`, `namespace`): + ```fish + kubectl describe pod -n --context=minikube-indri + ``` + +2. **Check events** — look for scheduling failures, image pull errors, or probe failures: + ```fish + kubectl get events -n --context=minikube-indri --sort-by='.lastTimestamp' | tail -20 + ``` + +3. **Check logs**: + ```fish + kubectl logs -n --context=minikube-indri --tail=50 + ``` + +4. **Check node resources**: + ```fish + kubectl top nodes --context=minikube-indri + kubectl top pods -n --context=minikube-indri + ``` + +## Common Causes + +- **CrashLoopBackOff** — app is crashing on startup, check logs +- **ImagePullBackOff** — container image not found or registry unreachable +- **Pending** — insufficient resources (CPU/memory), or PVC not bound +- **Readiness probe failing** — service is running but not healthy +- **NFS mount issue** — services depending on sifaka (kiwix, transmission, navidrome, jellyfin) will fail if NFS is down + +## Silencing + +1. Grafana → Alerting → Silences → Create Silence +2. Match `alertname = PodNotReady` +3. Optionally match `namespace = ` to silence a specific service + +## Related + +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/how-to/alerts/runbook-postgres-unhealthy.md b/docs/how-to/alerts/runbook-postgres-unhealthy.md new file mode 100644 index 0000000..2910851 --- /dev/null +++ b/docs/how-to/alerts/runbook-postgres-unhealthy.md @@ -0,0 +1,63 @@ +--- +title: "Runbook: PostgreSQL Cluster Unhealthy" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: PostgreSQL Cluster Unhealthy + +**Alert name:** `PostgresClusterUnhealthy` + +The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding. + +## Affected Services + +The `blumeops-pg` CNPG cluster on indri's minikube runs databases for: +- TeslaMate +- Authentik (cross-cluster from ringtail) +- Immich +- Grafana dashboards (TeslaMate datasource) + +## Diagnostic Steps + +1. **Check CNPG cluster status**: + ```fish + kubectl get cluster blumeops-pg -n databases --context=minikube-indri + kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri + ``` + +2. **Check pod logs**: + ```fish + kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30 + ``` + +3. **Check if pg_isready**: + ```fish + pg_isready -h pg.ops.eblu.me -p 5432 + ``` + +4. **Check PVC storage**: + ```fish + kubectl get pvc -n databases --context=minikube-indri + ``` + +## Common Causes + +- **Pod crash** — OOM, disk full, or configuration error +- **PVC storage full** — check with `kubectl exec` into the pod and `df -h` +- **Minikube issue** — if the node is under memory pressure, CNPG pods may be evicted +- **Network** — Caddy L4 proxy (`pg.ops.eblu.me`) may be misconfigured + +## Silencing + +For planned database maintenance: +1. Grafana → Alerting → Silences → Create Silence +2. Match `alertname = PostgresClusterUnhealthy` + +## Related + +- [[postgresql]] — CNPG cluster reference +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 852f5d3..1aae7b9 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -21,3 +21,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure. - [[deploy-infra-alerting]] - Alerting pipeline (Grafana Unified Alerting → ntfy) - [[runbook-service-probe-failure]] - Service health check failure runbook +- [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook +- [[runbook-pod-not-ready]] - Pod not ready runbook From 02e07aeb41c6ab427a9335dcb4f110371b10db8e Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 12:19:33 -0700 Subject: [PATCH 09/18] C2(deploy-infra-alerting): impl exclude Job-owned pods from PodNotReady CronJob pods (e.g., zim-watcher) are expected to complete and become not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 47f2ec6..26370a3 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -167,6 +167,8 @@ groups: model: expr: >- kube_pod_status_ready{condition="true"} == 0 + unless on (namespace, pod) + kube_pod_owner{owner_kind="Job"} interval: "" refId: A - refId: B From b2b0d6efa7715b695f3e601c151f42efcb68d1c5 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 13:35:24 -0700 Subject: [PATCH 10/18] C2(deploy-infra-alerting): impl fix PodNotReady noDataState to OK No unhealthy pods = no query results = noData state. With noDataState set to NoData, Grafana fires an alert with empty labels ("Pod in is not ready"). Change to OK since no results means everything is healthy. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 26370a3..3714bf1 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -150,7 +150,7 @@ groups: title: PodNotReady condition: C for: 5m - noDataState: NoData + noDataState: OK execErrState: Alerting annotations: summary: >- From 2fa536e547764c1d16cc9537146df0de843a92ff Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 13:43:16 -0700 Subject: [PATCH 11/18] C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts - TextfileStale: fires when a .prom textfile on indri hasn't been updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic, zot, minikube, jellyfin exporters. - FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m. - Add runbooks for both alerts. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 115 ++++++++++++++++++ .../alerts/runbook-frigate-camera-down.md | 39 ++++++ docs/how-to/alerts/runbook-textfile-stale.md | 58 +++++++++ docs/reference/operations/observability.md | 2 + 4 files changed, 214 insertions(+) create mode 100644 docs/how-to/alerts/runbook-frigate-camera-down.md create mode 100644 docs/how-to/alerts/runbook-textfile-stale.md diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index 3714bf1..dfcc5a3 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -84,6 +84,121 @@ groups: type: and refId: C + - orgId: 1 + name: textfile-freshness + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: textfile-stale + title: TextfileStale + condition: C + for: 15m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale + labels: + severity: warning + service: indri-metrics + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: >- + time() - node_textfile_mtime_seconds > 3600 + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: gt + params: + - 0 + operator: + type: and + refId: C + + - orgId: 1 + name: frigate-health + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: frigate-camera-down + title: FrigateCameraDown + condition: C + for: 5m + noDataState: Alerting + execErrState: Alerting + annotations: + summary: >- + Frigate camera {{ index $labels "camera_name" }} has 0 FPS + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down + labels: + severity: warning + service: frigate + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: frigate_camera_fps + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: lt + params: + - 1 + operator: + type: and + refId: C + - orgId: 1 name: database-health folder: Infrastructure Alerts diff --git a/docs/how-to/alerts/runbook-frigate-camera-down.md b/docs/how-to/alerts/runbook-frigate-camera-down.md new file mode 100644 index 0000000..ea04e79 --- /dev/null +++ b/docs/how-to/alerts/runbook-frigate-camera-down.md @@ -0,0 +1,39 @@ +--- +title: "Runbook: Frigate Camera Down" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Frigate Camera Down + +**Alert name:** `FrigateCameraDown` + +A Frigate camera has reported 0 FPS for 5+ minutes, meaning the camera feed is not being received. + +## Diagnostic Steps + +1. **Check Frigate UI** — https://nvr.ops.eblu.me — look at the camera thumbnail and status +2. **Check Frigate API stats**: + ```fish + curl -s https://nvr.ops.eblu.me/api/stats | python3 -m json.tool + ``` +3. **Check Frigate pod logs** on ringtail: + ```fish + kubectl logs -n frigate -l app=frigate --context=k3s-ringtail --tail=30 + ``` +4. **Check the camera itself** — verify it's powered on and network-connected. Try accessing the RTSP stream directly. + +## Common Causes + +- **Camera offline** — power outage, network issue, or camera crash +- **NFS mount lost** — Frigate storage on sifaka; if the NFS mount drops, recording stops and FPS may drop +- **Frigate pod restart** — during restart, camera FPS briefly drops to 0 +- **RTSP stream timeout** — camera firmware issue; power cycle the camera + +## Related + +- [[frigate]] — Frigate NVR reference +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/how-to/alerts/runbook-textfile-stale.md b/docs/how-to/alerts/runbook-textfile-stale.md new file mode 100644 index 0000000..2a70adf --- /dev/null +++ b/docs/how-to/alerts/runbook-textfile-stale.md @@ -0,0 +1,58 @@ +--- +title: "Runbook: Textfile Stale" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: Textfile Stale + +**Alert name:** `TextfileStale` + +A Prometheus textfile collector `.prom` file on indri has not been updated for over 1 hour, indicating the metrics exporter script has stopped running. + +## Affected Textfiles + +| File | LaunchAgent | What it monitors | +|------|-------------|------------------| +| `borgmatic.prom` | `mcquack.eblume.borgmatic` | Backup status | +| `zot.prom` | `mcquack.eblume.zot` | Container registry | +| `minikube.prom` | `mcquack.minikube-metrics` | Minikube cluster status | +| `jellyfin.prom` | `mcquack.eblume.jellyfin-metrics` | Media server | + +## Diagnostic Steps + +1. **Check which file is stale** — the `file` label in the alert tells you. Verify on indri: + ```fish + ssh indri 'ls -la /opt/homebrew/var/node_exporter/textfile/' + ``` + +2. **Check if the LaunchAgent is running**: + ```fish + ssh indri 'launchctl list | grep mcquack' + ``` + +3. **Check LaunchAgent logs** (plist defines stdout/stderr paths): + ```fish + ssh indri 'cat ~/Library/Logs/mcquack/.log' + ``` + +4. **Try running the exporter manually**: + ```fish + ssh indri 'cat ~/Library/LaunchAgents/mcquack..plist' + # Find the ProgramArguments, run them manually + ``` + +## Common Causes + +- **LaunchAgent not loaded** — `launchctl load ~/Library/LaunchAgents/mcquack..plist` +- **Script error** — the exporter script crashed; check logs +- **Permissions** — the textfile directory is not writable +- **Indri reboot** — some LaunchAgents may not auto-start + +## Related + +- [[alloy]] — Collects textfile metrics via `prometheus.exporter.unix` +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 1aae7b9..9d4a7a0 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -23,3 +23,5 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure. - [[runbook-service-probe-failure]] - Service health check failure runbook - [[runbook-postgres-unhealthy]] - PostgreSQL cluster health runbook - [[runbook-pod-not-ready]] - Pod not ready runbook +- [[runbook-textfile-stale]] - Metrics textfile freshness runbook +- [[runbook-frigate-camera-down]] - Frigate camera health runbook From 957ee90fa24c63b859afc2011d410bfc8df449e6 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 13:45:34 -0700 Subject: [PATCH 12/18] C2(deploy-infra-alerting): impl add ArgoCD scrape and sync alert - Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082) - Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has sync_status != Synced for 30 minutes - Add runbook with diagnostic steps and common fixes Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 58 +++++++++++++++++ argocd/manifests/prometheus/prometheus.yml | 8 +++ .../alerts/runbook-argocd-out-of-sync.md | 65 +++++++++++++++++++ docs/reference/operations/observability.md | 1 + 4 files changed, 132 insertions(+) create mode 100644 docs/how-to/alerts/runbook-argocd-out-of-sync.md diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index dfcc5a3..dcb6762 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -315,6 +315,64 @@ groups: type: and refId: C + - orgId: 1 + name: argocd-health + folder: Infrastructure Alerts + interval: 60s + rules: + - uid: argocd-app-out-of-sync + title: ArgoCDAppOutOfSync + condition: C + for: 30m + noDataState: OK + execErrState: Alerting + annotations: + summary: >- + ArgoCD app {{ index $labels "name" }} is {{ index $labels "sync_status" }} + runbook_url: https://docs.eblu.me/how-to/alerts/runbook-argocd-out-of-sync + labels: + severity: warning + service: argocd + data: + - refId: A + datasourceUid: prometheus + relativeTimeRange: + from: 300 + to: 0 + model: + expr: >- + argocd_app_info{sync_status!="Synced"} + interval: "" + refId: A + - refId: B + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: reduce + expression: A + reducer: last + settings: + mode: dropNN + refId: B + - refId: C + datasourceUid: "__expr__" + relativeTimeRange: + from: 0 + to: 0 + model: + type: threshold + expression: B + conditions: + - evaluator: + type: gt + params: + - 0 + operator: + type: and + refId: C + templates: - orgId: 1 name: ntfy-infra diff --git a/argocd/manifests/prometheus/prometheus.yml b/argocd/manifests/prometheus/prometheus.yml index 2d2dbcf..f96ce12 100644 --- a/argocd/manifests/prometheus/prometheus.yml +++ b/argocd/manifests/prometheus/prometheus.yml @@ -80,6 +80,14 @@ scrape_configs: - target_label: cluster replacement: indri + # ArgoCD application metrics + - job_name: "argocd" + static_configs: + - targets: ["argocd-metrics.argocd.svc.cluster.local:8082"] + metric_relabel_configs: + - target_label: cluster + replacement: indri + # Frigate NVR metrics (via Caddy on indri — Frigate runs on ringtail) - job_name: "frigate" scheme: https diff --git a/docs/how-to/alerts/runbook-argocd-out-of-sync.md b/docs/how-to/alerts/runbook-argocd-out-of-sync.md new file mode 100644 index 0000000..753b336 --- /dev/null +++ b/docs/how-to/alerts/runbook-argocd-out-of-sync.md @@ -0,0 +1,65 @@ +--- +title: "Runbook: ArgoCD App Out of Sync" +modified: 2026-03-22 +tags: + - how-to + - alerting + - runbook +--- + +# Runbook: ArgoCD App Out of Sync + +**Alert name:** `ArgoCDAppOutOfSync` + +An ArgoCD application has been out of sync for 30+ minutes. This means the live state in Kubernetes differs from what's declared in Git. + +## Diagnostic Steps + +1. **Check which app is out of sync** — the `name` label in the alert tells you: + ```fish + argocd app get + ``` + +2. **View the diff**: + ```fish + argocd app diff + ``` + +3. **Check if it's a branch revision issue** — during C1/C2 work, apps may be pointed at a feature branch. After merge, they need to be reset to main: + ```fish + argocd app get -o json | python3 -c "import json,sys; print(json.load(sys.stdin)['spec']['source']['targetRevision'])" + ``` + +4. **Check ArgoCD UI** — https://argocd.ops.eblu.me — look for sync errors or degraded status. + +## Common Causes + +- **Forgot to sync after push** — ArgoCD uses manual sync; changes require explicit `argocd app sync` +- **Branch revision not reset after PR merge** — app still points at a deleted branch +- **Kustomize/manifest error** — invalid YAML or unsatisfiable resource requirements +- **Pruning needed** — old ConfigMaps from `configMapGenerator` need pruning + +## Resolution + +```fish +# Simple sync +argocd app sync + +# If pruning is needed +argocd app sync --prune + +# If stuck on a deleted branch +argocd app set --revision main +argocd app sync +``` + +## Silencing + +During active C1/C2 development, apps may intentionally be out of sync: +1. Grafana → Alerting → Silences → Create Silence +2. Match `alertname = ArgoCDAppOutOfSync` and `name = ` + +## Related + +- [[argocd]] — ArgoCD reference +- [[deploy-infra-alerting]] — Alerting pipeline overview diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 9d4a7a0..35136d5 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -25,3 +25,4 @@ Metrics, logs, traces, and dashboards for BlumeOps infrastructure. - [[runbook-pod-not-ready]] - Pod not ready runbook - [[runbook-textfile-stale]] - Metrics textfile freshness runbook - [[runbook-frigate-camera-down]] - Frigate camera health runbook +- [[runbook-argocd-out-of-sync]] - ArgoCD sync status runbook From d9ab004479e387ce34ef7a20f2295efe3e9be187 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:05:25 -0700 Subject: [PATCH 13/18] C2(deploy-infra-alerting): impl fix Grafana probe port (80 not 3000) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Grafana's k8s Service maps port 80 → container port 3000. The blackbox probe was targeting port 3000 directly on the Service ClusterIP, which doesn't work — connection refused. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/alloy-k8s/config.alloy | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/argocd/manifests/alloy-k8s/config.alloy b/argocd/manifests/alloy-k8s/config.alloy index 667f735..a716ddc 100644 --- a/argocd/manifests/alloy-k8s/config.alloy +++ b/argocd/manifests/alloy-k8s/config.alloy @@ -184,7 +184,7 @@ prometheus.exporter.blackbox "services" { target { name = "grafana" - address = "http://grafana.monitoring.svc.cluster.local:3000/api/health" + address = "http://grafana.monitoring.svc.cluster.local:80/api/health" module = "http_2xx" } From da452e2bf579290bca483bb4c0b09c6139b7db9c Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:10:05 -0700 Subject: [PATCH 14/18] C2(deploy-infra-alerting): impl fix TextfileStale noDataState to OK Same pattern as PodNotReady: when no textfiles are stale, the query returns no data. noDataState=Alerting incorrectly treats this as a problem. Changed to OK. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index dcb6762..fb5b197 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -93,7 +93,7 @@ groups: title: TextfileStale condition: C for: 15m - noDataState: Alerting + noDataState: OK execErrState: Alerting annotations: summary: >- From cdd85c7ac92947abe535176d6013a73215369d1f Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:13:20 -0700 Subject: [PATCH 15/18] C2(deploy-infra-alerting): impl fix TextfileStale to always return data MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Query all textfile mtimes (time() - node_textfile_mtime_seconds) and threshold at > 3600s, instead of filtering with > 3600 which returns empty results when everything is fresh. This means: - Fresh textfiles: query returns low values, threshold not met → OK - Stale textfiles: query returns high values, threshold met → Alerting - Missing textfiles: series vanishes, noDataState=Alerting → Alerting Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index fb5b197..c0f0496 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -93,7 +93,7 @@ groups: title: TextfileStale condition: C for: 15m - noDataState: OK + noDataState: Alerting execErrState: Alerting annotations: summary: >- @@ -110,7 +110,7 @@ groups: to: 0 model: expr: >- - time() - node_textfile_mtime_seconds > 3600 + time() - node_textfile_mtime_seconds interval: "" refId: A - refId: B @@ -137,7 +137,7 @@ groups: - evaluator: type: gt params: - - 0 + - 3600 operator: type: and refId: C From 52eed44542de0f73efee759b6b609014a3d869ed Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:21:42 -0700 Subject: [PATCH 16/18] C2(deploy-infra-alerting): impl refactor services-check to query alerts Replace covered checks with Grafana alerting API queries: - ServiceProbeFailure: 11 HTTP endpoints - TextfileStale: metrics textfile freshness - FrigateCameraDown: camera FPS - PodNotReady: pod readiness (both clusters) - PostgresClusterUnhealthy: database health - ArgoCDAppOutOfSync: ArgoCD sync status Uncovered checks remain as direct probes (SSH, launchctl, public endpoints, k8s API, frigate storage, some HTTP endpoints). Firing alerts display summary and clickable runbook link. Grafana credentials fetched from 1Password; graceful fallback if unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) --- mise-tasks/services-check | 159 +++++++++++++++++++++++++++----------- 1 file changed, 114 insertions(+), 45 deletions(-) diff --git a/mise-tasks/services-check b/mise-tasks/services-check index 94ced03..9ba2c8e 100755 --- a/mise-tasks/services-check +++ b/mise-tasks/services-check @@ -6,6 +6,7 @@ set -euo pipefail # Colors for output RED='\033[0;31m' GREEN='\033[0;32m' +YELLOW='\033[0;33m' NC='\033[0m' # No Color FAILED=0 @@ -36,11 +37,88 @@ check_http() { fi } +# ============== Grafana Alerting API ============== + +GRAFANA_URL="https://grafana.ops.eblu.me" +GRAFANA_CREDS="" + +fetch_alerts() { + if [ -z "$GRAFANA_CREDS" ]; then + local pass + pass=$(op read 'op://vg6xf6vvfmoh5hqjjhlhbeoaie/oxkcr3xtxnewy7noep2izvyr6y/password' 2>/dev/null) || true + if [ -n "$pass" ]; then + GRAFANA_CREDS=$(echo -n "admin:$pass" | base64) + fi + fi + + if [ -z "$GRAFANA_CREDS" ]; then + echo "" + return + fi + + curl -sf --max-time 10 \ + -H "Authorization: Basic $GRAFANA_CREDS" \ + "$GRAFANA_URL/api/prometheus/grafana/api/v1/alerts" 2>/dev/null || echo "" +} + +# Fetch all alerts once +ALERTS_JSON=$(fetch_alerts) + +check_alert() { + local name="$1" + local alertname="$2" + # Optional: filter by a label key=value + local filter_key="${3:-}" + local filter_value="${4:-}" + + printf "%-24s " "$name..." + + if [ -z "$ALERTS_JSON" ]; then + echo -e "${YELLOW}NO DATA${NC} (can't reach Grafana alerting API)" + return + fi + + local firing + firing=$(echo "$ALERTS_JSON" | python3 -c " +import json, sys +try: + data = json.load(sys.stdin) +except: + sys.exit(1) +alerts = data.get('data', {}).get('alerts', []) +for a in alerts: + if a['labels'].get('alertname') != '$alertname': + continue + if '$filter_key' and a['labels'].get('$filter_key') != '$filter_value': + continue + if a['state'] in ('Alerting', 'Pending'): + url = a.get('annotations', {}).get('runbook_url', '') + summary = a.get('annotations', {}).get('summary', '') + print(f'{summary}|{url}') +" 2>/dev/null) + + if [ -z "$firing" ]; then + echo -e "${GREEN}OK${NC}" + else + local summary runbook + summary=$(echo "$firing" | head -1 | cut -d'|' -f1) + runbook=$(echo "$firing" | head -1 | cut -d'|' -f2) + echo -e "${RED}FIRING${NC}" + if [ -n "$summary" ]; then + echo -e " $summary" + fi + if [ -n "$runbook" ]; then + echo -e " Runbook: $runbook" + fi + FAILED=1 + fi +} + echo "Checking services..." echo "====================" echo "" -# Local services on indri +# Local services on indri (not yet covered by alerting) echo "Local services on indri:" check_service "forgejo (brew)" "ssh indri 'brew services list | grep forgejo | grep started'" check_service "alloy" "ssh indri 'launchctl list mcquack.eblume.alloy | grep -v \"^-\"'" @@ -52,43 +130,47 @@ check_service "minikube-metrics" "ssh indri 'launchctl list mcquack.minikube-met check_service "jellyfin-metrics" "ssh indri 'launchctl list mcquack.eblume.jellyfin-metrics | grep -v \"^-\"'" echo "" -echo "Metrics textfiles:" -check_service "borgmatic.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/borgmatic.prom'" -check_service "zot.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/zot.prom'" -check_service "minikube.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/minikube.prom'" -check_service "jellyfin.prom" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/jellyfin.prom'" +echo "Metrics textfiles (via alerting):" +check_alert "textfile-freshness" "TextfileStale" echo "" -echo "Kubernetes cluster:" +echo "Kubernetes cluster (not yet covered by alerting):" check_service "minikube" "ssh indri 'minikube status --format={{.Host}} | grep -q Running'" check_service "k8s-apiserver (indri)" "ssh indri 'kubectl get --raw /healthz'" check_service "k8s-apiserver (remote)" "kubectl --kubeconfig=$HOME/.kube/minikube-indri/config.yml --context=minikube-indri get --raw /healthz" echo "" -echo "HTTP endpoints (via Caddy):" -check_http "Prometheus" "https://prometheus.ops.eblu.me/-/healthy" -check_http "Loki" "https://loki.ops.eblu.me/ready" -check_http "Grafana" "https://grafana.ops.eblu.me/api/health" -check_http "ArgoCD" "https://argocd.ops.eblu.me/healthz" +echo "HTTP endpoints (via alerting):" +check_alert "Prometheus" "ServiceProbeFailure" "service" "prometheus" +check_alert "Loki" "ServiceProbeFailure" "service" "loki" +check_alert "Grafana" "ServiceProbeFailure" "service" "grafana" +check_alert "ArgoCD" "ServiceProbeFailure" "service" "argocd" +check_alert "Kiwix" "ServiceProbeFailure" "service" "kiwix" +check_alert "Miniflux" "ServiceProbeFailure" "service" "miniflux" +check_alert "TeslaMate" "ServiceProbeFailure" "service" "teslamate" +check_alert "Devpi" "ServiceProbeFailure" "service" "devpi" +check_alert "Transmission" "ServiceProbeFailure" "service" "transmission" +check_alert "Immich" "ServiceProbeFailure" "service" "immich" +check_alert "Navidrome" "ServiceProbeFailure" "service" "navidrome" + +echo "" +echo "HTTP endpoints (not yet covered by alerting):" check_http "Forgejo" "https://forge.eblu.me/" check_http "Zot Registry" "https://registry.ops.eblu.me/v2/_catalog" -check_http "Kiwix" "https://kiwix.ops.eblu.me/" -check_http "Miniflux" "https://feed.ops.eblu.me/healthcheck" -check_http "TeslaMate" "https://tesla.ops.eblu.me/" -check_http "Devpi" "https://pypi.ops.eblu.me/+api" -check_http "Transmission" "https://torrent.ops.eblu.me/" -check_http "Immich" "https://photos.ops.eblu.me/" -check_http "Navidrome" "https://dj.ops.eblu.me/" check_http "CV" "https://cv.ops.eblu.me/" check_http "Ntfy" "https://ntfy.ops.eblu.me/v1/health" check_http "Authentik" "https://authentik.ops.eblu.me/-/health/live/" check_http "Frigate" "https://nvr.ops.eblu.me/api/version" -check_service "frigate-camera-fps" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.cameras | to_entries | all(.value.camera_fps > 0)'" -check_service "frigate-storage" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.service.storage | to_entries | map(select(.key | startswith(\"/media\"))) | length > 0 and all(.[]; .value.free > 0)'" check_http "JobSync" "https://jobsync.ops.eblu.me/" echo "" -echo "Ringtail (NixOS):" +echo "Frigate (via alerting):" +check_alert "camera-fps" "FrigateCameraDown" +echo "Frigate (not yet covered by alerting):" +check_service "frigate-storage" "curl -sf --max-time 5 https://nvr.ops.eblu.me/api/stats | jq -e '.service.storage | to_entries | map(select(.key | startswith(\"/media\"))) | length > 0 and all(.[]; .value.free > 0)'" + +echo "" +echo "Ringtail (not yet covered by alerting):" check_service "ssh" "ssh -o ConnectTimeout=5 ringtail true" check_service "tailscale" "ssh ringtail 'tailscale status --self --json' | jq -e '.Self.Online' > /dev/null" check_service "k3s" "ssh ringtail 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml k3s kubectl get nodes --no-headers | grep -q Ready'" @@ -96,43 +178,30 @@ check_service "k3s-apiserver (remote)" "kubectl --context=k3s-ringtail get --raw check_service "forgejo-runner" "ssh ringtail 'systemctl is-active gitea-runner-nix_container_builder.service'" echo "" -echo "Ringtail k3s pods:" -check_service "ntfy" "kubectl --context=k3s-ringtail -n ntfy get pods -l app=ntfy -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "authentik" "kubectl --context=k3s-ringtail -n authentik get pods -l component=server -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "frigate" "kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "frigate-notify" "kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate-notify -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "nvidia-device-plugin" "kubectl --context=k3s-ringtail -n nvidia-device-plugin get pods -l app=nvidia-device-plugin -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "jobsync" "kubectl --context=k3s-ringtail -n jobsync get pods -l app=jobsync -o jsonpath='{.items[0].status.phase}' | grep -q Running" +echo "Pod health (via alerting):" +check_alert "pod-readiness" "PodNotReady" echo "" -echo "Public services (via Fly.io):" +echo "Database (via alerting):" +check_alert "PostgreSQL" "PostgresClusterUnhealthy" + +echo "" +echo "Public services (not yet covered by alerting):" check_http "Docs (public)" "https://docs.eblu.me/" check_http "CV (public)" "https://cv.eblu.me/" check_http "Forge (public)" "https://forge.eblu.me/" check_http "Fly.io healthz" "https://blumeops-proxy.fly.dev/healthz" echo "" -echo "Database:" -check_service "PostgreSQL (k8s)" "pg_isready -h pg.ops.eblu.me -p 5432" - -echo "" -echo "Indri minikube pods:" -check_service "prometheus-0" "kubectl --context=minikube-indri -n monitoring get pod prometheus-0 -o jsonpath='{.status.phase}' | grep -q Running" -check_service "loki-0" "kubectl --context=minikube-indri -n monitoring get pod loki-0 -o jsonpath='{.status.phase}' | grep -q Running" -check_service "grafana" "kubectl --context=minikube-indri -n monitoring get pods -l app.kubernetes.io/name=grafana -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "miniflux" "kubectl --context=minikube-indri -n miniflux get pods -l app=miniflux -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "teslamate" "kubectl --context=minikube-indri -n teslamate get pods -l app=teslamate -o jsonpath='{.items[0].status.phase}' | grep -q Running" -check_service "blumeops-pg" "kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg -o jsonpath='{.items[0].status.phase}' | grep -q Running" - -echo "" -echo "ArgoCD app sync status:" +echo "ArgoCD app sync status (via alerting):" +check_alert "argocd-sync" "ArgoCDAppOutOfSync" +# Keep the detailed table as a summary view printf "%-20s %-12s %-12s %s\n" "NAME" "SYNC" "HEALTH" "TARGET" while read -r name sync health target; do if [[ "$sync" == "Synced" ]]; then printf "%-20s ${GREEN}%-12s${NC} %-12s %s\n" "$name" "$sync" "$health" "$target" elif [[ "$sync" == "OutOfSync" ]]; then printf "%-20s ${RED}%-12s${NC} %-12s %s\n" "$name" "$sync" "$health" "$target" - FAILED=1 else printf "%-20s %-12s %-12s %s\n" "$name" "$sync" "$health" "$target" fi From 2e2a33d7ca30e38a380638f93dffe39a4a6743d1 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:23:42 -0700 Subject: [PATCH 17/18] C2(deploy-infra-alerting): close port-services-check-alerts 7 alert rules covering services-check probes: - ServiceProbeFailure (11 HTTP probes via Alloy blackbox) - PodNotReady (kube-state-metrics, both clusters) - PostgresClusterUnhealthy (CNPG collector) - TextfileStale (node_textfile_mtime_seconds) - FrigateCameraDown (frigate_camera_fps) - ArgoCDAppOutOfSync (argocd_app_info) 7 runbooks in docs/how-to/alerts/. Remaining uncovered: local indri services (brew/launchctl), ringtail SSH/tailscale, public Fly.io endpoints, k8s API health, frigate storage. These are effectively covered by downstream alerts. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/how-to/alerts/port-services-check-alerts.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/alerts/port-services-check-alerts.md index 807c340..c2ea6ad 100644 --- a/docs/how-to/alerts/port-services-check-alerts.md +++ b/docs/how-to/alerts/port-services-check-alerts.md @@ -1,7 +1,6 @@ --- title: Port services-check Alerts to Grafana modified: 2026-03-22 -status: active requires: - first-alert-and-runbook tags: From 67883950c31aec6a0abaea43c9223d6102ef9723 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sun, 22 Mar 2026 14:24:47 -0700 Subject: [PATCH 18/18] C2(deploy-infra-alerting): finalize rewrite cards as historical docs Remove all Mikado frontmatter (status, branch, requires) from chain cards. Rename docs/how-to/alerts/ to docs/how-to/runbooks/ and update all runbook_url references. Add changelog fragment. Co-Authored-By: Claude Opus 4.6 (1M context) --- argocd/manifests/grafana/alerting.yaml | 12 ++++++------ .../mikado-deploy-infra-alerting.feature.md | 1 + .../configure-grafana-alerting-pipeline.md | 0 .../{alerts => runbooks}/deploy-infra-alerting.md | 6 +----- .../{alerts => runbooks}/first-alert-and-runbook.md | 6 ++---- .../port-services-check-alerts.md | 6 ++---- .../refactor-services-check-to-query-alerts.md | 3 --- .../runbook-argocd-out-of-sync.md | 0 .../runbook-frigate-camera-down.md | 0 .../{alerts => runbooks}/runbook-pod-not-ready.md | 0 .../runbook-postgres-unhealthy.md | 0 .../runbook-service-probe-failure.md | 0 .../{alerts => runbooks}/runbook-textfile-stale.md | 0 13 files changed, 12 insertions(+), 22 deletions(-) create mode 100644 docs/changelog.d/mikado-deploy-infra-alerting.feature.md rename docs/how-to/{alerts => runbooks}/configure-grafana-alerting-pipeline.md (100%) rename docs/how-to/{alerts => runbooks}/deploy-infra-alerting.md (94%) rename docs/how-to/{alerts => runbooks}/first-alert-and-runbook.md (91%) rename docs/how-to/{alerts => runbooks}/port-services-check-alerts.md (94%) rename docs/how-to/{alerts => runbooks}/refactor-services-check-to-query-alerts.md (97%) rename docs/how-to/{alerts => runbooks}/runbook-argocd-out-of-sync.md (100%) rename docs/how-to/{alerts => runbooks}/runbook-frigate-camera-down.md (100%) rename docs/how-to/{alerts => runbooks}/runbook-pod-not-ready.md (100%) rename docs/how-to/{alerts => runbooks}/runbook-postgres-unhealthy.md (100%) rename docs/how-to/{alerts => runbooks}/runbook-service-probe-failure.md (100%) rename docs/how-to/{alerts => runbooks}/runbook-textfile-stale.md (100%) diff --git a/argocd/manifests/grafana/alerting.yaml b/argocd/manifests/grafana/alerting.yaml index c0f0496..abc4c0f 100644 --- a/argocd/manifests/grafana/alerting.yaml +++ b/argocd/manifests/grafana/alerting.yaml @@ -40,7 +40,7 @@ groups: annotations: summary: >- {{ index $labels "service" }} health check is failing - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure labels: severity: warning data: @@ -98,7 +98,7 @@ groups: annotations: summary: >- Metrics textfile {{ index $labels "file" }} has not been updated in over 1 hour - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-textfile-stale + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-textfile-stale labels: severity: warning service: indri-metrics @@ -156,7 +156,7 @@ groups: annotations: summary: >- Frigate camera {{ index $labels "camera_name" }} has 0 FPS - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-frigate-camera-down + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-frigate-camera-down labels: severity: warning service: frigate @@ -213,7 +213,7 @@ groups: annotations: summary: >- PostgreSQL cluster {{ index $labels "cluster" }} is unhealthy - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-postgres-unhealthy + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-postgres-unhealthy labels: severity: critical service: postgresql @@ -270,7 +270,7 @@ groups: annotations: summary: >- Pod {{ index $labels "pod" }} in {{ index $labels "namespace" }} is not ready - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-pod-not-ready + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-pod-not-ready labels: severity: warning data: @@ -329,7 +329,7 @@ groups: annotations: summary: >- ArgoCD app {{ index $labels "name" }} is {{ index $labels "sync_status" }} - runbook_url: https://docs.eblu.me/how-to/alerts/runbook-argocd-out-of-sync + runbook_url: https://docs.eblu.me/how-to/runbooks/runbook-argocd-out-of-sync labels: severity: warning service: argocd diff --git a/docs/changelog.d/mikado-deploy-infra-alerting.feature.md b/docs/changelog.d/mikado-deploy-infra-alerting.feature.md new file mode 100644 index 0000000..7106014 --- /dev/null +++ b/docs/changelog.d/mikado-deploy-infra-alerting.feature.md @@ -0,0 +1 @@ +Deploy infrastructure alerting pipeline using Grafana Unified Alerting with ntfy push notifications. 7 alert rules with runbooks covering service health, pod readiness, PostgreSQL, textfile freshness, Frigate cameras, and ArgoCD sync status. services-check now queries the alerting API for covered checks. diff --git a/docs/how-to/alerts/configure-grafana-alerting-pipeline.md b/docs/how-to/runbooks/configure-grafana-alerting-pipeline.md similarity index 100% rename from docs/how-to/alerts/configure-grafana-alerting-pipeline.md rename to docs/how-to/runbooks/configure-grafana-alerting-pipeline.md diff --git a/docs/how-to/alerts/deploy-infra-alerting.md b/docs/how-to/runbooks/deploy-infra-alerting.md similarity index 94% rename from docs/how-to/alerts/deploy-infra-alerting.md rename to docs/how-to/runbooks/deploy-infra-alerting.md index 7c2e7f0..e02523d 100644 --- a/docs/how-to/alerts/deploy-infra-alerting.md +++ b/docs/how-to/runbooks/deploy-infra-alerting.md @@ -1,10 +1,6 @@ --- title: Deploy Infrastructure Alerting Pipeline modified: 2026-03-22 -status: active -branch: mikado/deploy-infra-alerting -requires: - - refactor-services-check-to-query-alerts tags: - how-to - alerting @@ -35,7 +31,7 @@ Loki (logs) ──────────┘ │ | **Alert engine** | Grafana Unified Alerting | Already deployed, no new service needed | | **Notification** | ntfy webhook contact point | Already deployed on ringtail, iOS app works | | **Anti-noise** | 24h repeat interval | Page once per day max per alert group | -| **Runbooks** | `docs/how-to/alerts/.md` | Clickable link in every notification | +| **Runbooks** | `docs/how-to/runbooks/.md` | Clickable link in every notification | | **Provisioning** | Grafana provisioning YAML (GitOps) | Alerts defined in repo, not just UI | | **Topic** | `infra-alerts` (separate from `frigate-alerts`) | Different severity/audience | diff --git a/docs/how-to/alerts/first-alert-and-runbook.md b/docs/how-to/runbooks/first-alert-and-runbook.md similarity index 91% rename from docs/how-to/alerts/first-alert-and-runbook.md rename to docs/how-to/runbooks/first-alert-and-runbook.md index 71b86bf..6ce13bf 100644 --- a/docs/how-to/alerts/first-alert-and-runbook.md +++ b/docs/how-to/runbooks/first-alert-and-runbook.md @@ -1,8 +1,6 @@ --- title: First Alert and Runbook modified: 2026-03-22 -requires: - - configure-grafana-alerting-pipeline tags: - how-to - alerting @@ -32,7 +30,7 @@ Provision via YAML in the alerting provisioning ConfigMap. The rule should: ### 3. Create the Runbook -Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc explaining: +Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining: - What the alert means - How to check which service is down - Common causes and resolution steps @@ -52,7 +50,7 @@ Write `docs/how-to/alerts/runbook-service-probe-failure.md` as a how-to doc expl - Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies - The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name -- The runbook URL format: `https://docs.eblu.me/how-to/alerts/runbook-service-probe-failure` +- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure` ## Verification diff --git a/docs/how-to/alerts/port-services-check-alerts.md b/docs/how-to/runbooks/port-services-check-alerts.md similarity index 94% rename from docs/how-to/alerts/port-services-check-alerts.md rename to docs/how-to/runbooks/port-services-check-alerts.md index c2ea6ad..4420f58 100644 --- a/docs/how-to/alerts/port-services-check-alerts.md +++ b/docs/how-to/runbooks/port-services-check-alerts.md @@ -1,8 +1,6 @@ --- title: Port services-check Alerts to Grafana modified: 2026-03-22 -requires: - - first-alert-and-runbook tags: - how-to - alerting @@ -43,7 +41,7 @@ For each check category, create provisioned Grafana alert rules. Group related c ### 4. Create Runbooks -One runbook per alert type in `docs/how-to/alerts/runbook-.md`. Each runbook should cover: +One runbook per alert type in `docs/how-to/runbooks/runbook-.md`. Each runbook should cover: - What the alert means - Diagnostic steps - Common fixes @@ -65,7 +63,7 @@ As each check is ported, remove it from the services-check script (or mark it as - [ ] All HTTP endpoint checks from services-check have corresponding alert rules - [ ] Pod health checks have corresponding alert rules - [ ] PostgreSQL health has a corresponding alert rule -- [ ] Each alert rule has a runbook doc in `docs/how-to/alerts/` +- [ ] Each alert rule has a runbook doc in `docs/how-to/runbooks/` - [ ] Test at least 2-3 failure scenarios end-to-end - [ ] services-check script has been updated to reflect ported checks diff --git a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md b/docs/how-to/runbooks/refactor-services-check-to-query-alerts.md similarity index 97% rename from docs/how-to/alerts/refactor-services-check-to-query-alerts.md rename to docs/how-to/runbooks/refactor-services-check-to-query-alerts.md index 640bcff..244be1f 100644 --- a/docs/how-to/alerts/refactor-services-check-to-query-alerts.md +++ b/docs/how-to/runbooks/refactor-services-check-to-query-alerts.md @@ -1,9 +1,6 @@ --- title: Refactor services-check to Query Alerts modified: 2026-03-22 -status: active -requires: - - port-services-check-alerts tags: - how-to - alerting diff --git a/docs/how-to/alerts/runbook-argocd-out-of-sync.md b/docs/how-to/runbooks/runbook-argocd-out-of-sync.md similarity index 100% rename from docs/how-to/alerts/runbook-argocd-out-of-sync.md rename to docs/how-to/runbooks/runbook-argocd-out-of-sync.md diff --git a/docs/how-to/alerts/runbook-frigate-camera-down.md b/docs/how-to/runbooks/runbook-frigate-camera-down.md similarity index 100% rename from docs/how-to/alerts/runbook-frigate-camera-down.md rename to docs/how-to/runbooks/runbook-frigate-camera-down.md diff --git a/docs/how-to/alerts/runbook-pod-not-ready.md b/docs/how-to/runbooks/runbook-pod-not-ready.md similarity index 100% rename from docs/how-to/alerts/runbook-pod-not-ready.md rename to docs/how-to/runbooks/runbook-pod-not-ready.md diff --git a/docs/how-to/alerts/runbook-postgres-unhealthy.md b/docs/how-to/runbooks/runbook-postgres-unhealthy.md similarity index 100% rename from docs/how-to/alerts/runbook-postgres-unhealthy.md rename to docs/how-to/runbooks/runbook-postgres-unhealthy.md diff --git a/docs/how-to/alerts/runbook-service-probe-failure.md b/docs/how-to/runbooks/runbook-service-probe-failure.md similarity index 100% rename from docs/how-to/alerts/runbook-service-probe-failure.md rename to docs/how-to/runbooks/runbook-service-probe-failure.md diff --git a/docs/how-to/alerts/runbook-textfile-stale.md b/docs/how-to/runbooks/runbook-textfile-stale.md similarity index 100% rename from docs/how-to/alerts/runbook-textfile-stale.md rename to docs/how-to/runbooks/runbook-textfile-stale.md