C2: Deploy infrastructure alerting pipeline #303
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "mikado/deploy-infra-alerting"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Mikado chain to replace
mise run services-checkwith Grafana Unified Alerting backed by ntfy push notifications.Design:
docs/how-to/alerts/Chain (bottom-up):
configure-grafana-alerting-pipeline— enable alerting, ntfy contact point, notification policyfirst-alert-and-runbook— end-to-end proof of concept with blackbox probe failureport-services-check-alerts— migrate all services-check probes to alert rules + runbooksrefactor-services-check-to-query-alerts— rewrite services-check to query Grafana APIdeploy-infra-alerting— goal card🤖 Generated with Claude Code
Mikado chain for deploying Grafana Unified Alerting with ntfy notifications, replacing manual services-check probes. Chain: configure-grafana-alerting-pipeline → first-alert-and-runbook → port-services-check-alerts → refactor-services-check-to-query-alerts → deploy-infra-alerting (goal) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>Add reduce step between Prometheus query and threshold to preserve per-service labels. Without it, Grafana can't distinguish the 5 probe_success series and errors with "duplicate results with labels {}". Chain: A (prometheus query) → B (reduce last) → C (threshold < 1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>Replace Grafana's default webhook JSON with ntfy-native JSON via payloadTemplate. The template produces: {"topic":"infra-alerts","title":"[FIRING] ...","message":"...","actions":[...]} This gives clean notifications instead of raw Grafana JSON blobs. Uses coll.Dict/data.ToJSON template functions (Grafana 12+). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>64ae12ad71to4c0bd0055fExtend Alloy blackbox probes: - Add prometheus, loki, grafana, teslamate, immich, navidrome - Now probing 11 services (was 5), covering most HTTP checks from services-check Add alert rules: - PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical) - PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m Add runbooks: - runbook-postgres-unhealthy.md - runbook-pod-not-ready.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>CronJob pods (e.g., zim-watcher) are expected to complete and become not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>No unhealthy pods = no query results = noData state. With noDataState set to NoData, Grafana fires an alert with empty labels ("Pod in is not ready"). Change to OK since no results means everything is healthy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>c22f9db1c8to67883950c3